Codex

eval-loop

Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met

104 stars

Best use case

eval-loop is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

It is a strong fit for teams already working in Codex.

Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met

Teams using eval-loop should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-loop/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/.agents/skills/eval-loop/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/eval-loop/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How eval-loop Compares

Feature / Agenteval-loopStandard Approach
Platform SupportCodexLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met

Which AI agents support this skill?

This skill is designed for Codex.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Eval Loop

**You are the Eval Loop Orchestrator** — configuring and running production quality gates for LLM inference pipelines.

## Natural Language Triggers

- "evaluate this pipeline"
- "set up evals for..."
- "run the eval loop on..."
- "add a quality gate to..."
- "test this prompt against cases"

## Parameters

### Pipeline directory (positional)
Path to pipeline directory containing `pipeline.config.yaml` and `prompts/`.

### --threshold (default: 0.85)
Pass threshold (0.0–1.0). Cases below this score trigger refinement.

### --max-attempts (default: 3)
Maximum generation attempts per case before marking as failed.

### --cases (optional)
Override test case file path (default: `eval/cases.jsonl`).

### --interactive (optional)
Pause after each batch to review failures before iterating.

## Execution

### Step 1: Isolation Check

Before running, verify:
- `prompts/evaluator.prompt.md` exists and is **separate** from generator prompts
- Evaluator prompt contains `{{input}}` and `{{output}}` only — no generator context
- Evaluator prompt does NOT reference chain-of-thought, intermediate steps, or generator system prompt

If isolation check fails:
```
ERROR: Evaluator isolation violation detected.

The evaluator prompt at prompts/evaluator.prompt.md contains
generator context (found: "{{steps}}" on line 12).

Fix: Remove all generator-internal variables from evaluator prompt.
Only {{input}} and {{output}} are allowed.
```

### Step 2: Load Test Cases

Read `eval/cases.jsonl`. Each line is a test case:
```json
{"id": "case_001", "input": "...", "expected": "...", "tags": ["happy-path"]}
```

Minimum recommended: 5 cases (3 happy path, 1 edge case, 1 failure/adversarial).

### Step 3: Run Eval Loop

For each test case:

```
attempt = 1
while attempt <= max_attempts:
    output = generator(case.input)
    result = evaluator(case.input, output)   ← isolated call
    if result.pass:
        record(PASS, attempt, result)
        break
    else:
        if attempt < max_attempts:
            output = refine(output, result.feedback)
        else:
            record(FAIL, attempt, result)
    attempt += 1
```

Write each result to `eval/results.jsonl` (append-only, validated against eval-result schema).

### Step 4: Summary Report

After all cases:

```
Eval Results: pipelines/<name>/
  ✓ 21/23 passed (91.3%)
  ✗  2 failures:
    case_004: score 0.40 — missing 'variant' field
    case_019: score 0.20 — hallucinated 'brand' from partial input
  Avg score: 0.94
  Avg attempts: 1.3
  Total cost: $0.0041 (23 cases × haiku)

Top recommendation:
  Tighten extract.prompt.md lines 12-15 re: variant extraction
```

### Step 5: Prompt Improvement Suggestions

If pass rate < threshold, aggregate feedback and suggest targeted prompt changes:
- Group failures by `failure_category`
- Surface the most common `suggested_fix`
- Do NOT rewrite the whole prompt — suggest one change at a time

## Isolation Protocol (critical)

The evaluator is a **separate agent call** from the generator. These invariants are enforced:

| Invariant | Enforcement |
|-----------|------------|
| Evaluator has no generator system prompt | Separate prompt file; no shared context |
| Evaluator has no chain-of-thought | Only `{{input}}` and `{{output}}` passed |
| Evaluator has no intermediate steps | Single call with final output only |
| Evaluator uses a cheaper model | `eval_model: haiku` in eval_config |

If you detect contamination mid-run, stop and flag it rather than continue with compromised results.

## References

- @$AIWG_ROOT/agentic/code/addons/nlp-prod/README.md — nlp-prod addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete pass thresholds and max-attempts escape hatch requirements
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Evaluator isolation as separate agent call
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon providing complementary agent evaluation

Related Skills

gate-evaluation

104
from jmagly/aiwg

Validate phase gate criteria with multi-agent review and generate pass/fail reports

Codex

eval-workflow

104
from jmagly/aiwg

Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance

Codex

eval-report

104
from jmagly/aiwg

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

Codex

eval-agent

104
from jmagly/aiwg

Run evaluation tests against an agent to assess quality and archetype resistance

Codex

agent-loop

104
from jmagly/aiwg

Detect requests for iterative autonomous agent loops and route to the appropriate loop executor

Codex

agent-loop-ext

104
from jmagly/aiwg

Crash-resilient external agent loop with state persistence and CI/CD integration

Codex

aiwg-orchestrate

104
from jmagly/aiwg

Route structured artifact work to AIWG workflows via MCP with zero parent context cost

venv-manager

104
from jmagly/aiwg

Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.

pytest-runner

104
from jmagly/aiwg

Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.

vitest-runner

104
from jmagly/aiwg

Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.

eslint-checker

104
from jmagly/aiwg

Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.

repo-analyzer

104
from jmagly/aiwg

Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.