promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

promptfoo-evaluation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using promptfoo-evaluation should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/promptfoo-evaluation/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/daymade/claude-code-skills/promptfoo-evaluation/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/promptfoo-evaluation/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How promptfoo-evaluation Compares

Feature / Agent	promptfoo-evaluation	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Promptfoo Evaluation

## Overview

This skill provides guidance for configuring and running LLM evaluations using [Promptfoo](https://www.promptfoo.dev/), an open-source CLI tool for testing and comparing LLM outputs.

## Quick Start

```bash
# Initialize a new evaluation project
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval

# View results in browser
npx promptfoo@latest view
```

## Configuration Structure

A typical Promptfoo project structure:

```
project/
├── promptfooconfig.yaml    # Main configuration
├── prompts/
│   ├── system.md           # System prompt
│   └── chat.json           # Chat format prompt
├── tests/
│   └── cases.yaml          # Test cases
└── scripts/
    └── metrics.py          # Custom Python assertions
```

## Core Configuration (promptfooconfig.yaml)

```yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"

# Prompts to test
prompts:
  - file://prompts/system.md
  - file://prompts/chat.json

# Models to compare
providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
  - id: openai:gpt-4.1
    label: GPT-4.1

# Test cases
tests: file://tests/cases.yaml

# Concurrency control (MUST be under commandLineOptions, NOT top-level)
commandLineOptions:
  maxConcurrency: 2

# Default assertions for all tests
defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:custom_assert
    - type: llm-rubric
      value: |
        Evaluate the response quality on a 0-1 scale.
      threshold: 0.7

# Output path
outputPath: results/eval-results.json
```

## Prompt Formats

### Text Prompt (system.md)

```markdown
You are a helpful assistant.

Task: {{task}}
Context: {{context}}
```

### Chat Format (chat.json)

```json
[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]
```

### Few-Shot Pattern

Embed examples directly in prompt or use chat format with assistant messages:

```json
[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]
```

## Test Cases (tests/cases.yaml)

```yaml
- description: "Test case 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # Load content from files
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8
```

## Python Custom Assertions

Create a Python file for custom assertions (e.g., `scripts/metrics.py`):

```python
def get_assert(output: str, context: dict) -> dict:
    """Default assertion function."""
    vars_dict = context.get('vars', {})

    # Access test variables
    expected = vars_dict.get('expected', '')

    # Return result
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """Custom named assertion."""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }
```

**Key points:**
- Default function name is `get_assert`
- Specify function with `file://path.py:function_name`
- Return `bool`, `float` (score), or `dict` with pass/score/reason
- Access variables via `context['vars']`

## LLM-as-Judge (llm-rubric)

```yaml
assert:
  - type: llm-rubric
    value: |
      Evaluate the response based on:
      1. Accuracy of information
      2. Clarity of explanation
      3. Completeness

      Score 0.0-1.0 where 0.7+ is passing.
    threshold: 0.7
    provider: openai:gpt-4.1  # Optional: override grader model
```

**When using a relay/proxy API**, each `llm-rubric` assertion needs its own `provider` config with `apiBaseUrl`. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:

```yaml
assert:
  - type: llm-rubric
    value: |
      Evaluate quality on a 0-1 scale.
    threshold: 0.7
    provider:
      id: anthropic:messages:claude-sonnet-4-6
      config:
        apiBaseUrl: https://your-relay.example.com/api
```

**Best practices:**
- Provide clear scoring criteria
- Use `threshold` to set minimum passing score
- Default grader uses available API keys (OpenAI → Anthropic → Google)
- **When using relay/proxy**: every `llm-rubric` must have its own `provider` with `apiBaseUrl` — the main provider's `apiBaseUrl` is NOT inherited

## Common Assertion Types

| Type | Usage | Example |
|------|-------|---------|
| `contains` | Check substring | `value: "hello"` |
| `icontains` | Case-insensitive | `value: "HELLO"` |
| `equals` | Exact match | `value: "42"` |
| `regex` | Pattern match | `value: "\\d{4}"` |
| `python` | Custom logic | `value: file://script.py` |
| `llm-rubric` | LLM grading | `value: "Is professional"` |
| `latency` | Response time | `threshold: 1000` |

## File References

All `file://` paths are resolved relative to `promptfooconfig.yaml` location (NOT the YAML file containing the reference). This is a common gotcha when `tests:` references a separate YAML file — the `file://` paths inside that test file still resolve from the config root.

```yaml
# Load file content as variable
vars:
  content: file://data/input.txt

# Load prompt from file
prompts:
  - file://prompts/main.md

# Load test cases from file
tests: file://tests/cases.yaml

# Load Python assertion
assert:
  - type: python
    value: file://scripts/check.py:validate
```

## Running Evaluations

```bash
# Basic run
npx promptfoo@latest eval

# With specific config
npx promptfoo@latest eval --config path/to/config.yaml

# Output to file
npx promptfoo@latest eval --output results.json

# Filter tests
npx promptfoo@latest eval --filter-metadata category=math

# View results
npx promptfoo@latest view
```

## Relay / Proxy API Configuration

When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:

```yaml
providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
    config:
      max_tokens: 4096
      apiBaseUrl: https://your-relay.example.com/api  # Promptfoo appends /v1/messages

# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
commandLineOptions:
  maxConcurrency: 1  # Respect relay rate limits
```

**Key rules:**
- `apiBaseUrl` goes in `providers[].config` — Promptfoo appends `/v1/messages` automatically
- `maxConcurrency` must be under `commandLineOptions:` — placing it at top level is silently ignored
- When using relay with LLM-as-judge, set `maxConcurrency: 1` to avoid concurrent request limits (generation + grading share the same pool)
- Pass relay token as `ANTHROPIC_API_KEY` env var

## Troubleshooting

**Python not found:**
```bash
export PROMPTFOO_PYTHON=python3
```

**Large outputs truncated:**
Outputs over 30000 characters are truncated. Use `head_limit` in assertions.

**File not found errors:**
All `file://` paths resolve relative to `promptfooconfig.yaml` location.

**maxConcurrency ignored (shows "up to N at a time"):**
`maxConcurrency` must be under `commandLineOptions:`, not at the YAML top level. This is a common mistake.

**LLM-as-judge returns 401 with relay API:**
Each `llm-rubric` assertion must have its own `provider` with `apiBaseUrl`. The main provider config is not inherited by grader assertions.

**HTML tags in model output inflating metrics:**
Models may output `<br>`, `<b>`, etc. in structured content. Strip HTML in Python assertions before measuring:
```python
import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)
```

## Echo Provider (Preview Mode)

Use the **echo provider** to preview rendered prompts without making API calls:

```yaml
# promptfooconfig-preview.yaml
providers:
  - echo  # Returns prompt as output, no API calls

tests:
  - vars:
      input: "test content"
```

**Use cases:**
- Preview prompt rendering before expensive API calls
- Verify Few-shot examples are loaded correctly
- Debug variable substitution issues
- Validate prompt structure

```bash
# Run preview mode
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
```

**Cost:** Free - no API tokens consumed.

## Advanced Few-Shot Implementation

### Multi-turn Conversation Pattern

For complex few-shot learning with full examples:

```json
[
  {"role": "system", "content": "{{system_prompt}}"},

  // Few-shot Example 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // Few-shot Example 2 (optional)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // Actual test
  {"role": "user", "content": "Task: {{actual_input}}"}
]
```

**Test case configuration:**

```yaml
tests:
  - vars:
      system_prompt: file://prompts/system.md
      # Few-shot examples
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # Actual test
      actual_input: file://data/test1.txt
```

**Best practices:**
- Use 1-3 few-shot examples (more may dilute effectiveness)
- Ensure examples match the task format exactly
- Load examples from files for better maintainability
- Use echo provider first to verify structure

## Long Text Handling

For Chinese/long-form content evaluations (10k+ characters):

**Configuration:**

```yaml
providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      max_tokens: 8192  # Increase for long outputs

defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:check_length
```

**Python assertion for text metrics:**

```python
import re

def strip_tags(text: str) -> str:
    """Remove HTML tags for pure text."""
    return re.sub(r'<[^>]+>', '', text)

def check_length(output: str, context: dict) -> dict:
    """Check output length constraints."""
    raw_input = context['vars'].get('raw_input', '')

    input_len = len(strip_tags(raw_input))
    output_len = len(strip_tags(output))

    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0

    return {
        "pass": 0.7 <= reduction_ratio <= 0.9,
        "score": reduction_ratio,
        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
        "named_scores": {
            "input_length": input_len,
            "output_length": output_len,
            "reduction_ratio": reduction_ratio
        }
    }
```

## Real-World Example

**Project:** Chinese short-video content curation from long transcripts

**Structure:**
```
tiaogaoren/
├── promptfooconfig.yaml          # Production config
├── promptfooconfig-preview.yaml  # Preview config (echo provider)
├── prompts/
│   ├── tiaogaoren-prompt.json   # Chat format with few-shot
│   └── v4/system-v4.md          # System prompt
├── tests/cases.yaml              # 3 test samples
├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
├── data/                         # 5 samples (2 few-shot, 3 eval)
└── results/
```

**See:** `./tiaogaoren/` (example project root) for full implementation.

## Resources

For detailed API reference and advanced patterns, see [references/promptfoo_api.md](references/promptfoo_api.md).

Related Skills

model-evaluation-metrics

from ComeOnOliver/skillshub

Model Evaluation Metrics - Auto-activating skill for ML Training. Triggers on: model evaluation metrics, model evaluation metrics Part of the ML Training skill category.

llm-evaluation

from ComeOnOliver/skillshub

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

agent-evaluation

from ComeOnOliver/skillshub

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

advanced-evaluation

from ComeOnOliver/skillshub

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

evaluation

from ComeOnOliver/skillshub

Build evaluation frameworks for agent systems. Use when testing agent performance, validating context engineering choices, or measuring improvements over time.

content-evaluation-framework

from ComeOnOliver/skillshub

This skill should be used when evaluating the quality of book chapters, lessons, or educational content. It provides a systematic 6-category rubric with weighted scoring (Technical Accuracy 30%, Pedagogical Effectiveness 25%, Writing Quality 20%, Structure & Organization 15%, AI-First Teaching 10%, Constitution Compliance Pass/Fail) and multi-tier assessment (Excellent/Good/Needs Work/Insufficient). Use this during iterative drafting, after content completion, on-demand review requests, or before validation phases.

Ragas — RAG Evaluation Framework

from ComeOnOliver/skillshub

## Overview

Promptfoo

from ComeOnOliver/skillshub

## Overview

DeepEval — LLM Testing & Evaluation Framework

from ComeOnOliver/skillshub

## Overview

Braintrust — AI Evaluation and Observability

from ComeOnOliver/skillshub

You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.

lm-evaluation-harness - LLM Benchmarking

from ComeOnOliver/skillshub

## Quick start

BigCode Evaluation Harness - Code Model Benchmarking

from ComeOnOliver/skillshub

## Quick Start