eval-loop
Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met
Best use case
eval-loop is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
It is a strong fit for teams already working in Codex.
Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met
Teams using eval-loop should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/eval-loop/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How eval-loop Compares
| Feature / Agent | eval-loop | Standard Approach |
|---|---|---|
| Platform Support | Codex | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met
Which AI agents support this skill?
This skill is designed for Codex.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
SKILL.md Source
# Eval Loop
**You are the Eval Loop Orchestrator** — configuring and running production quality gates for LLM inference pipelines.
## Natural Language Triggers
- "evaluate this pipeline"
- "set up evals for..."
- "run the eval loop on..."
- "add a quality gate to..."
- "test this prompt against cases"
## Parameters
### Pipeline directory (positional)
Path to pipeline directory containing `pipeline.config.yaml` and `prompts/`.
### --threshold (default: 0.85)
Pass threshold (0.0–1.0). Cases below this score trigger refinement.
### --max-attempts (default: 3)
Maximum generation attempts per case before marking as failed.
### --cases (optional)
Override test case file path (default: `eval/cases.jsonl`).
### --interactive (optional)
Pause after each batch to review failures before iterating.
## Execution
### Step 1: Isolation Check
Before running, verify:
- `prompts/evaluator.prompt.md` exists and is **separate** from generator prompts
- Evaluator prompt contains `{{input}}` and `{{output}}` only — no generator context
- Evaluator prompt does NOT reference chain-of-thought, intermediate steps, or generator system prompt
If isolation check fails:
```
ERROR: Evaluator isolation violation detected.
The evaluator prompt at prompts/evaluator.prompt.md contains
generator context (found: "{{steps}}" on line 12).
Fix: Remove all generator-internal variables from evaluator prompt.
Only {{input}} and {{output}} are allowed.
```
### Step 2: Load Test Cases
Read `eval/cases.jsonl`. Each line is a test case:
```json
{"id": "case_001", "input": "...", "expected": "...", "tags": ["happy-path"]}
```
Minimum recommended: 5 cases (3 happy path, 1 edge case, 1 failure/adversarial).
### Step 3: Run Eval Loop
For each test case:
```
attempt = 1
while attempt <= max_attempts:
output = generator(case.input)
result = evaluator(case.input, output) ← isolated call
if result.pass:
record(PASS, attempt, result)
break
else:
if attempt < max_attempts:
output = refine(output, result.feedback)
else:
record(FAIL, attempt, result)
attempt += 1
```
Write each result to `eval/results.jsonl` (append-only, validated against eval-result schema).
### Step 4: Summary Report
After all cases:
```
Eval Results: pipelines/<name>/
✓ 21/23 passed (91.3%)
✗ 2 failures:
case_004: score 0.40 — missing 'variant' field
case_019: score 0.20 — hallucinated 'brand' from partial input
Avg score: 0.94
Avg attempts: 1.3
Total cost: $0.0041 (23 cases × haiku)
Top recommendation:
Tighten extract.prompt.md lines 12-15 re: variant extraction
```
### Step 5: Prompt Improvement Suggestions
If pass rate < threshold, aggregate feedback and suggest targeted prompt changes:
- Group failures by `failure_category`
- Surface the most common `suggested_fix`
- Do NOT rewrite the whole prompt — suggest one change at a time
## Isolation Protocol (critical)
The evaluator is a **separate agent call** from the generator. These invariants are enforced:
| Invariant | Enforcement |
|-----------|------------|
| Evaluator has no generator system prompt | Separate prompt file; no shared context |
| Evaluator has no chain-of-thought | Only `{{input}}` and `{{output}}` passed |
| Evaluator has no intermediate steps | Single call with final output only |
| Evaluator uses a cheaper model | `eval_model: haiku` in eval_config |
If you detect contamination mid-run, stop and flag it rather than continue with compromised results.
## References
- @$AIWG_ROOT/agentic/code/addons/nlp-prod/README.md — nlp-prod addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete pass thresholds and max-attempts escape hatch requirements
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Evaluator isolation as separate agent call
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon providing complementary agent evaluationRelated Skills
gate-evaluation
Validate phase gate criteria with multi-agent review and generate pass/fail reports
eval-workflow
Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
eval-report
Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
eval-agent
Run evaluation tests against an agent to assess quality and archetype resistance
agent-loop
Detect requests for iterative autonomous agent loops and route to the appropriate loop executor
agent-loop-ext
Crash-resilient external agent loop with state persistence and CI/CD integration
aiwg-orchestrate
Route structured artifact work to AIWG workflows via MCP with zero parent context cost
venv-manager
Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.
pytest-runner
Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.
vitest-runner
Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.
eslint-checker
Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.
repo-analyzer
Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.