Promptfoo

## Overview

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

Promptfoo is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

## Overview

Teams using Promptfoo should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/promptfoo/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/promptfoo/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/promptfoo/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Promptfoo Compares

Feature / Agent	Promptfoo	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

## Overview

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Promptfoo

## Overview

Promptfoo is an open-source framework for testing LLM prompts — define test cases, run them against one or more models, and assert on outputs. Think "unit tests for prompts." Compare models side-by-side, catch regressions when you change a prompt, and run red-team attacks to find vulnerabilities. Web UI for viewing results, CLI for CI integration.

## When to Use

- Changing a prompt and want to make sure it doesn't break existing behavior
- Comparing model performance (GPT-4o vs Claude vs Gemini on your use case)
- Red-teaming an LLM application for prompt injection and harmful outputs
- Building a prompt evaluation suite for CI/CD
- Systematic prompt engineering (not vibes-based)

## Instructions

### Setup

```bash
npm install -g promptfoo
# Or: npx promptfoo@latest
```

### Basic Evaluation

```yaml
# promptfooconfig.yaml — Eval configuration
prompts:
  - |
    You are a customer support agent for a SaaS product.
    Answer the following customer question concisely and helpfully.

    Question: {{question}}

providers:
  - openai:gpt-4o
  - anthropic:messages:claude-sonnet-4-20250514

tests:
  - vars:
      question: "How do I reset my password?"
    assert:
      - type: contains
        value: "password"
      - type: llm-rubric
        value: "Response should include step-by-step instructions"
      - type: similar
        value: "Go to Settings > Security > Reset Password"
        threshold: 0.7

  - vars:
      question: "Can I get a refund?"
    assert:
      - type: contains-any
        value: ["refund", "return", "money back"]
      - type: llm-rubric
        value: "Response should mention the refund policy and timeline"
      - type: not-contains
        value: "I don't know"

  - vars:
      question: "Your product sucks and I hate it"
    assert:
      - type: llm-rubric
        value: "Response should be professional and empathetic, not defensive"
      - type: not-contains-any
        value: ["sorry you feel that way", "I understand your frustration but"]
```

```bash
# Run evaluation
promptfoo eval

# View results in web UI
promptfoo view
```

### Assertion Types

```yaml
tests:
  - vars: { input: "Translate 'hello' to French" }
    assert:
      # Exact/partial match
      - type: equals
        value: "Bonjour"
      - type: contains
        value: "bonjour"
      - type: icontains           # Case-insensitive
        value: "bonjour"

      # Regex
      - type: regex
        value: "\\b[Bb]onjour\\b"

      # LLM-as-judge
      - type: llm-rubric
        value: "Translation is accurate and natural-sounding"

      # Semantic similarity
      - type: similar
        value: "Hello in French is Bonjour"
        threshold: 0.8

      # JSON validation
      - type: is-json
      - type: javascript
        value: "output.length < 500"

      # Safety
      - type: not-contains
        value: "I cannot"
      - type: llm-rubric
        value: "Response does not contain harmful content"

      # Latency and cost
      - type: latency
        threshold: 3000           # Max 3 seconds
      - type: cost
        threshold: 0.01           # Max $0.01 per call
```

### Model Comparison

```yaml
# compare.yaml — Side-by-side model comparison
prompts:
  - "Summarize this article in 3 bullet points:\n\n{{article}}"

providers:
  - openai:gpt-4o
  - openai:gpt-4o-mini
  - anthropic:messages:claude-sonnet-4-20250514
  - anthropic:messages:claude-haiku-4-20250514

tests:
  - vars:
      article: "{{file://test-articles/ai-regulation.txt}}"
    assert:
      - type: llm-rubric
        value: "Summary captures the 3 most important points"
      - type: javascript
        value: "output.split('\\n').filter(l => l.startsWith('•')).length === 3"
      - type: latency
        threshold: 5000
```

### Red-Teaming

```bash
# Auto-generate adversarial test cases
promptfoo redteam init
promptfoo redteam run

# Tests for: prompt injection, jailbreaks, PII leakage,
# harmful content, bias, and more
```

### CI Integration

```yaml
# .github/workflows/prompt-eval.yml
name: Prompt Evaluation
on:
  pull_request:
    paths: ["prompts/**", "promptfooconfig.yaml"]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npx promptfoo@latest eval --ci
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - run: npx promptfoo@latest eval --output results.json
      - uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results.json
```

## Examples

### Example 1: Test a customer support chatbot

**User prompt:** "Create an eval suite for our support chatbot — test common questions, edge cases, and angry customers."

The agent will create test cases across categories (FAQ, billing, technical, hostile), with LLM-rubric assertions for quality and safety checks.

### Example 2: Choose the best model for my use case

**User prompt:** "I need to pick between GPT-4o, Claude Sonnet, and Gemini Flash for code review. Help me decide."

The agent will set up a comparison eval with code review prompts, assertions for accuracy and helpfulness, and cost/latency thresholds.

## Guidelines

- **`llm-rubric` is the most flexible assertion** — uses an LLM to judge quality
- **`similar` for semantic matching** — doesn't require exact text match
- **`vars` for test data** — parameterize prompts with different inputs
- **File-based test data** — `{{file://path}}` for long test inputs
- **Red-team before production** — `promptfoo redteam` finds injection vulnerabilities
- **CI integration catches regressions** — run on every prompt change
- **Web UI for analysis** — `promptfoo view` shows results side-by-side
- **Cost assertions** — prevent expensive prompts from slipping into production
- **Multiple providers = comparison** — run same tests across models
- **Start with 10-20 test cases** — cover happy path, edge cases, and adversarial inputs

Related Skills

promptfoo-evaluation

from ComeOnOliver/skillshub

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

Daily Logs

from ComeOnOliver/skillshub

Record the user's daily activities, progress, decisions, and learnings in a structured, chronological format.

Socratic Method: The Dialectic Engine

from ComeOnOliver/skillshub

This skill transforms Claude into a Socratic agent — a cognitive partner who guides

Sokratische Methode: Die Dialektik-Maschine

from ComeOnOliver/skillshub

Dieser Skill verwandelt Claude in einen sokratischen Agenten — einen kognitiven Partner, der Nutzende durch systematisches Fragen zur Wissensentdeckung führt, anstatt direkt zu instruieren.

College Football Data (CFB)

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

College Basketball Data (CBB)

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

Betting Analysis

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for odds formats, command parameters, and key concepts.

Research Proposal Generator

from ComeOnOliver/skillshub

Generate high-quality academic research proposals for PhD applications following Nature Reviews-style academic writing conventions.

Paper Slide Deck Generator

from ComeOnOliver/skillshub

Transform academic papers and content into professional slide deck images with automatic figure extraction.

Medical Imaging AI Literature Review Skill

from ComeOnOliver/skillshub

Write comprehensive literature reviews following a systematic 7-phase workflow.

Meeting Briefing Skill

from ComeOnOliver/skillshub

You are a meeting preparation assistant for an in-house legal team. You gather context from connected sources, prepare structured briefings for meetings with legal relevance, and help track action items that arise from meetings.

Canned Responses Skill

from ComeOnOliver/skillshub

You are a response template assistant for an in-house legal team. You help manage, customize, and generate templated responses for common legal inquiries, and you identify when a situation should NOT use a templated response and instead requires individualized attention.