eval-runner

Run LLM evaluation test suites and detect regressions. Use when you need to: test LLM responses against expected outputs, score responses with exact match, regex, or AI judge, compare model performance across runs, detect quality regressions in CI, or benchmark multiple models. Triggers include "LLM eval", "test my prompts", "evaluate model", "run evals", "regression test LLM", "score responses", "compare models", or any task requiring systematic LLM quality measurement.

7 stars

byheldernoid

View on GitHub Installation ↓

Best use case

eval-runner is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using eval-runner should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-runner/SKILL.md --create-dirs "https://raw.githubusercontent.com/heldernoid/agentic-build-templates/main/projects/ai-llm-tools/eval-runner/skills/eval-runner/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/eval-runner/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How eval-runner Compares

Feature / Agent	eval-runner	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# eval-runner

LLM evaluation framework. Define test suites in YAML, run against model endpoints, score responses, detect regressions.

## When to use

- Testing LLM responses against expected outputs
- Scoring response quality with exact match, regex, or LLM-as-judge
- Comparing model performance (gpt-4o vs gpt-4-turbo, etc.)
- Detecting quality regressions after model or prompt changes
- Running evals in CI/CD pipelines with exit code support

## Quick Start

```bash
# Start the server
eval-runner start

# Run a suite from a YAML file
eval-runner run --suite my-suite.yaml --endpoint openai-prod

# Check for regressions (exits 1 if regression detected)
eval-runner check --run <run-id> --exit-code
```

## YAML Suite Format

```yaml
name: "My Test Suite"
model: "gpt-4o"
endpoint: "openai-prod"

defaults:
  temperature: 0
  max_tokens: 512
  system: "You are a helpful assistant."

cases:
  - id: "case-001"
    name: "Basic factual question"
    input: "What is the capital of France?"
    expected: "Paris"
    scoring:
      - method: contains
        value: "Paris"

  - id: "case-002"
    name: "Quality response check"
    input: "Explain recursion briefly"
    scoring:
      - method: llm-judge
        rubric: "The response clearly explains recursion with an example"
        threshold: 0.8
```

## Scoring Methods

| Method | Description |
|---|---|
| `exact_match` | Response must exactly match expected value (normalized) |
| `contains` | Response must contain the expected substring |
| `regex` | Response must match the regex pattern |
| `json_match` | Parse response as JSON and check a field value |
| `llm-judge` | LLM rates the response against a rubric (0.0-1.0) |

Multiple scorers per case are combined as a weighted average.

## CLI Reference

| Command | Description |
|---|---|
| `eval-runner run --suite <file>` | Run a suite from YAML file |
| `eval-runner run --suite <file> --model <model>` | Override model |
| `eval-runner run --suite <file> --baseline <run-id>` | Compare against baseline |
| `eval-runner check --run <id> --exit-code` | Exit 0/1 based on regression |
| `eval-runner suite list` | List suites stored in server |
| `eval-runner suite import <file>` | Import YAML to server |
| `eval-runner export --run <id> --format json` | Export run results |
| `eval-runner --help` | Show help |

## Environment Variables

| Variable | Description | Default |
|---|---|---|
| `EVAL_PORT` | Server port | 4090 |
| `EVAL_DASHBOARD_PORT` | Dashboard port | 4091 |
| `EVAL_DATA_DIR` | SQLite and data directory | ~/.eval-runner |
| `EVAL_ENCRYPTION_KEY` | 32-byte hex for endpoint key encryption | required |
| `EVAL_CONCURRENCY` | Max parallel case executions per run | 5 |
| `EVAL_CASE_TIMEOUT_MS` | Timeout per case in milliseconds | 30000 |
| `EVAL_REGRESSION_THRESHOLD` | Per-case score drop threshold | -0.1 |
| `EVAL_DEV` | Dev mode (1 or 0) | 0 |

## Regression Detection

A case is regressed if: `current_score - baseline_score < EVAL_REGRESSION_THRESHOLD`

A run is marked as regression if: `regressed_cases / total_cases > regression_rate_threshold` (default 0.05)

Set a baseline when starting a run:
```bash
eval-runner run --suite my-suite.yaml --baseline run_abc123
```

## Troubleshooting

### Endpoint unreachable

Check the base URL in Settings. Test with `curl <base_url>/models`. Verify the API key is saved correctly.

### LLM judge returns null score

The judge model returned malformed JSON. The case scores 0.0. Check the judge model endpoint is working. Try a different judge model.

### Cases timeout

Increase `EVAL_CASE_TIMEOUT_MS`. Long-running cases may need 60,000ms or more.

Related Skills

ssh-runner

from heldernoid/agentic-build-templates

Execute read-only commands on remote Linux servers via SSH using the ssh2 npm package.

load-test-runner

from heldernoid/agentic-build-templates

No description provided.

task-runner

from heldernoid/agentic-build-templates

Run monorepo tasks with dependency ordering using monorepo-task-runner. Covers mtr run, mtr status, tasks.yaml format, and the web dashboard.

Skill: Uptime Monitoring

from heldernoid/agentic-build-templates

## Overview

Skill: Status Page

from heldernoid/agentic-build-templates

## Overview

Skill: unit-conversion

from heldernoid/agentic-build-templates

## Overview

Skill: recipe-scaler

from heldernoid/agentic-build-templates

## Overview

reading-list

from heldernoid/agentic-build-templates

Operate the reading-list API to save, manage, tag, search, and export articles.

email-digest

from heldernoid/agentic-build-templates

Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.

websocket-realtime

from heldernoid/agentic-build-templates

Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".

poll-builder

from heldernoid/agentic-build-templates

Self-hosted poll creation tool with real-time results. Use when you need to create a poll, check vote counts, close a poll, export results, or get the shareable link for a poll. Triggers include "create poll", "vote", "poll results", "survey", "collect votes", "share poll", or any task involving polling or voting.

Skill: personal-finance

from heldernoid/agentic-build-templates

## Overview