eval-runner

Run LLM evaluation test suites and detect regressions. Use when you need to: test LLM responses against expected outputs, score responses with exact match, regex, or AI judge, compare model performance across runs, detect quality regressions in CI, or benchmark multiple models. Triggers include "LLM eval", "test my prompts", "evaluate model", "run evals", "regression test LLM", "score responses", "compare models", or any task requiring systematic LLM quality measurement.

7 stars

Best use case

eval-runner is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Run LLM evaluation test suites and detect regressions. Use when you need to: test LLM responses against expected outputs, score responses with exact match, regex, or AI judge, compare model performance across runs, detect quality regressions in CI, or benchmark multiple models. Triggers include "LLM eval", "test my prompts", "evaluate model", "run evals", "regression test LLM", "score responses", "compare models", or any task requiring systematic LLM quality measurement.

Teams using eval-runner should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-runner/SKILL.md --create-dirs "https://raw.githubusercontent.com/heldernoid/agentic-build-templates/main/projects/ai-llm-tools/eval-runner/skills/eval-runner/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/eval-runner/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How eval-runner Compares

Feature / Agenteval-runnerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Run LLM evaluation test suites and detect regressions. Use when you need to: test LLM responses against expected outputs, score responses with exact match, regex, or AI judge, compare model performance across runs, detect quality regressions in CI, or benchmark multiple models. Triggers include "LLM eval", "test my prompts", "evaluate model", "run evals", "regression test LLM", "score responses", "compare models", or any task requiring systematic LLM quality measurement.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# eval-runner

LLM evaluation framework. Define test suites in YAML, run against model endpoints, score responses, detect regressions.

## When to use

- Testing LLM responses against expected outputs
- Scoring response quality with exact match, regex, or LLM-as-judge
- Comparing model performance (gpt-4o vs gpt-4-turbo, etc.)
- Detecting quality regressions after model or prompt changes
- Running evals in CI/CD pipelines with exit code support

## Quick Start

```bash
# Start the server
eval-runner start

# Run a suite from a YAML file
eval-runner run --suite my-suite.yaml --endpoint openai-prod

# Check for regressions (exits 1 if regression detected)
eval-runner check --run <run-id> --exit-code
```

## YAML Suite Format

```yaml
name: "My Test Suite"
model: "gpt-4o"
endpoint: "openai-prod"

defaults:
  temperature: 0
  max_tokens: 512
  system: "You are a helpful assistant."

cases:
  - id: "case-001"
    name: "Basic factual question"
    input: "What is the capital of France?"
    expected: "Paris"
    scoring:
      - method: contains
        value: "Paris"

  - id: "case-002"
    name: "Quality response check"
    input: "Explain recursion briefly"
    scoring:
      - method: llm-judge
        rubric: "The response clearly explains recursion with an example"
        threshold: 0.8
```

## Scoring Methods

| Method | Description |
|---|---|
| `exact_match` | Response must exactly match expected value (normalized) |
| `contains` | Response must contain the expected substring |
| `regex` | Response must match the regex pattern |
| `json_match` | Parse response as JSON and check a field value |
| `llm-judge` | LLM rates the response against a rubric (0.0-1.0) |

Multiple scorers per case are combined as a weighted average.

## CLI Reference

| Command | Description |
|---|---|
| `eval-runner run --suite <file>` | Run a suite from YAML file |
| `eval-runner run --suite <file> --model <model>` | Override model |
| `eval-runner run --suite <file> --baseline <run-id>` | Compare against baseline |
| `eval-runner check --run <id> --exit-code` | Exit 0/1 based on regression |
| `eval-runner suite list` | List suites stored in server |
| `eval-runner suite import <file>` | Import YAML to server |
| `eval-runner export --run <id> --format json` | Export run results |
| `eval-runner --help` | Show help |

## Environment Variables

| Variable | Description | Default |
|---|---|---|
| `EVAL_PORT` | Server port | 4090 |
| `EVAL_DASHBOARD_PORT` | Dashboard port | 4091 |
| `EVAL_DATA_DIR` | SQLite and data directory | ~/.eval-runner |
| `EVAL_ENCRYPTION_KEY` | 32-byte hex for endpoint key encryption | required |
| `EVAL_CONCURRENCY` | Max parallel case executions per run | 5 |
| `EVAL_CASE_TIMEOUT_MS` | Timeout per case in milliseconds | 30000 |
| `EVAL_REGRESSION_THRESHOLD` | Per-case score drop threshold | -0.1 |
| `EVAL_DEV` | Dev mode (1 or 0) | 0 |

## Regression Detection

A case is regressed if: `current_score - baseline_score < EVAL_REGRESSION_THRESHOLD`

A run is marked as regression if: `regressed_cases / total_cases > regression_rate_threshold` (default 0.05)

Set a baseline when starting a run:
```bash
eval-runner run --suite my-suite.yaml --baseline run_abc123
```

## Troubleshooting

### Endpoint unreachable

Check the base URL in Settings. Test with `curl <base_url>/models`. Verify the API key is saved correctly.

### LLM judge returns null score

The judge model returned malformed JSON. The case scores 0.0. Check the judge model endpoint is working. Try a different judge model.

### Cases timeout

Increase `EVAL_CASE_TIMEOUT_MS`. Long-running cases may need 60,000ms or more.

Related Skills

ssh-runner

7
from heldernoid/agentic-build-templates

Execute read-only commands on remote Linux servers via SSH using the ssh2 npm package.

load-test-runner

7
from heldernoid/agentic-build-templates

No description provided.

task-runner

7
from heldernoid/agentic-build-templates

Run monorepo tasks with dependency ordering using monorepo-task-runner. Covers mtr run, mtr status, tasks.yaml format, and the web dashboard.

Skill: Uptime Monitoring

7
from heldernoid/agentic-build-templates

## Overview

Skill: Status Page

7
from heldernoid/agentic-build-templates

## Overview

Skill: unit-conversion

7
from heldernoid/agentic-build-templates

## Overview

Skill: recipe-scaler

7
from heldernoid/agentic-build-templates

## Overview

reading-list

7
from heldernoid/agentic-build-templates

Operate the reading-list API to save, manage, tag, search, and export articles.

email-digest

7
from heldernoid/agentic-build-templates

Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.

websocket-realtime

7
from heldernoid/agentic-build-templates

Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".

poll-builder

7
from heldernoid/agentic-build-templates

Self-hosted poll creation tool with real-time results. Use when you need to create a poll, check vote counts, close a poll, export results, or get the shareable link for a poll. Triggers include "create poll", "vote", "poll results", "survey", "collect votes", "share poll", or any task involving polling or voting.

Skill: personal-finance

7
from heldernoid/agentic-build-templates

## Overview