eval-runner
Run LLM evaluation test suites and detect regressions. Use when you need to: test LLM responses against expected outputs, score responses with exact match, regex, or AI judge, compare model performance across runs, detect quality regressions in CI, or benchmark multiple models. Triggers include "LLM eval", "test my prompts", "evaluate model", "run evals", "regression test LLM", "score responses", "compare models", or any task requiring systematic LLM quality measurement.
Best use case
eval-runner is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Run LLM evaluation test suites and detect regressions. Use when you need to: test LLM responses against expected outputs, score responses with exact match, regex, or AI judge, compare model performance across runs, detect quality regressions in CI, or benchmark multiple models. Triggers include "LLM eval", "test my prompts", "evaluate model", "run evals", "regression test LLM", "score responses", "compare models", or any task requiring systematic LLM quality measurement.
Teams using eval-runner should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/eval-runner/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How eval-runner Compares
| Feature / Agent | eval-runner | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run LLM evaluation test suites and detect regressions. Use when you need to: test LLM responses against expected outputs, score responses with exact match, regex, or AI judge, compare model performance across runs, detect quality regressions in CI, or benchmark multiple models. Triggers include "LLM eval", "test my prompts", "evaluate model", "run evals", "regression test LLM", "score responses", "compare models", or any task requiring systematic LLM quality measurement.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# eval-runner
LLM evaluation framework. Define test suites in YAML, run against model endpoints, score responses, detect regressions.
## When to use
- Testing LLM responses against expected outputs
- Scoring response quality with exact match, regex, or LLM-as-judge
- Comparing model performance (gpt-4o vs gpt-4-turbo, etc.)
- Detecting quality regressions after model or prompt changes
- Running evals in CI/CD pipelines with exit code support
## Quick Start
```bash
# Start the server
eval-runner start
# Run a suite from a YAML file
eval-runner run --suite my-suite.yaml --endpoint openai-prod
# Check for regressions (exits 1 if regression detected)
eval-runner check --run <run-id> --exit-code
```
## YAML Suite Format
```yaml
name: "My Test Suite"
model: "gpt-4o"
endpoint: "openai-prod"
defaults:
temperature: 0
max_tokens: 512
system: "You are a helpful assistant."
cases:
- id: "case-001"
name: "Basic factual question"
input: "What is the capital of France?"
expected: "Paris"
scoring:
- method: contains
value: "Paris"
- id: "case-002"
name: "Quality response check"
input: "Explain recursion briefly"
scoring:
- method: llm-judge
rubric: "The response clearly explains recursion with an example"
threshold: 0.8
```
## Scoring Methods
| Method | Description |
|---|---|
| `exact_match` | Response must exactly match expected value (normalized) |
| `contains` | Response must contain the expected substring |
| `regex` | Response must match the regex pattern |
| `json_match` | Parse response as JSON and check a field value |
| `llm-judge` | LLM rates the response against a rubric (0.0-1.0) |
Multiple scorers per case are combined as a weighted average.
## CLI Reference
| Command | Description |
|---|---|
| `eval-runner run --suite <file>` | Run a suite from YAML file |
| `eval-runner run --suite <file> --model <model>` | Override model |
| `eval-runner run --suite <file> --baseline <run-id>` | Compare against baseline |
| `eval-runner check --run <id> --exit-code` | Exit 0/1 based on regression |
| `eval-runner suite list` | List suites stored in server |
| `eval-runner suite import <file>` | Import YAML to server |
| `eval-runner export --run <id> --format json` | Export run results |
| `eval-runner --help` | Show help |
## Environment Variables
| Variable | Description | Default |
|---|---|---|
| `EVAL_PORT` | Server port | 4090 |
| `EVAL_DASHBOARD_PORT` | Dashboard port | 4091 |
| `EVAL_DATA_DIR` | SQLite and data directory | ~/.eval-runner |
| `EVAL_ENCRYPTION_KEY` | 32-byte hex for endpoint key encryption | required |
| `EVAL_CONCURRENCY` | Max parallel case executions per run | 5 |
| `EVAL_CASE_TIMEOUT_MS` | Timeout per case in milliseconds | 30000 |
| `EVAL_REGRESSION_THRESHOLD` | Per-case score drop threshold | -0.1 |
| `EVAL_DEV` | Dev mode (1 or 0) | 0 |
## Regression Detection
A case is regressed if: `current_score - baseline_score < EVAL_REGRESSION_THRESHOLD`
A run is marked as regression if: `regressed_cases / total_cases > regression_rate_threshold` (default 0.05)
Set a baseline when starting a run:
```bash
eval-runner run --suite my-suite.yaml --baseline run_abc123
```
## Troubleshooting
### Endpoint unreachable
Check the base URL in Settings. Test with `curl <base_url>/models`. Verify the API key is saved correctly.
### LLM judge returns null score
The judge model returned malformed JSON. The case scores 0.0. Check the judge model endpoint is working. Try a different judge model.
### Cases timeout
Increase `EVAL_CASE_TIMEOUT_MS`. Long-running cases may need 60,000ms or more.Related Skills
ssh-runner
Execute read-only commands on remote Linux servers via SSH using the ssh2 npm package.
load-test-runner
No description provided.
task-runner
Run monorepo tasks with dependency ordering using monorepo-task-runner. Covers mtr run, mtr status, tasks.yaml format, and the web dashboard.
Skill: Uptime Monitoring
## Overview
Skill: Status Page
## Overview
Skill: unit-conversion
## Overview
Skill: recipe-scaler
## Overview
reading-list
Operate the reading-list API to save, manage, tag, search, and export articles.
email-digest
Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.
websocket-realtime
Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".
poll-builder
Self-hosted poll creation tool with real-time results. Use when you need to create a poll, check vote counts, close a poll, export results, or get the shareable link for a poll. Triggers include "create poll", "vote", "poll results", "survey", "collect votes", "share poll", or any task involving polling or voting.
Skill: personal-finance
## Overview