Codex

eval-agent

Run evaluation tests against an agent to assess quality and archetype resistance

104 stars

Best use case

eval-agent is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

It is a strong fit for teams already working in Codex.

Run evaluation tests against an agent to assess quality and archetype resistance

Teams using eval-agent should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-agent/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/.agents/skills/eval-agent/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/eval-agent/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How eval-agent Compares

Feature / Agenteval-agentStandard Approach
Platform SupportCodexLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Run evaluation tests against an agent to assess quality and archetype resistance

Which AI agents support this skill?

This skill is designed for Codex.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Agent Evaluation

Run automated evaluation tests against an agent.

## Research Foundation

- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for failure archetype detection

## Usage

```bash
/eval-agent security-architect
/eval-agent architecture-designer --category archetype
/eval-agent test-engineer --scenario grounding-test --verbose
```

## Arguments

| Argument | Required | Description |
|----------|----------|-------------|
| agent-name | Yes | Agent to evaluate |

## Options

| Option | Default | Description |
|--------|---------|-------------|
| --category | all | Test category: archetype, performance, quality |
| --scenario | all | Specific scenario to run |
| --verbose | false | Show detailed test output |
| --output | stdout | Output file for results |
| --strict | false | Fail on any test failure |

## Test Categories

### archetype

Tests for Roig (2025) failure archetypes:

- `grounding-test` - Archetype 1: Premature action
- `substitution-test` - Archetype 2: Over-helpfulness
- `distractor-test` - Archetype 3: Context pollution
- `recovery-test` - Archetype 4: Fragile execution

### performance

- `latency-test` - Response time benchmarks
- `token-test` - Token efficiency
- `parallel-test` - Concurrent execution correctness

### quality

- `output-format` - Output structure validation
- `tool-usage` - Appropriate tool selection
- `scope-adherence` - Stays within defined scope

## Process

1. **Load Agent**: Read agent definition
2. **Select Scenarios**: Based on --category or --scenario
3. **Setup Environment**: Create test workspace
4. **Execute Tests**: Run agent against each scenario
5. **Validate Results**: Check assertions
6. **Generate Report**: Output results

## Output Format

```json
{
  "agent": "security-architect",
  "timestamp": "2025-01-15T10:30:00Z",
  "tests": {
    "grounding-test": {
      "passed": true,
      "score": 1.0,
      "details": "Read tool called before Edit",
      "duration_ms": 5000
    },
    "distractor-test": {
      "passed": false,
      "score": 0.6,
      "details": "Used staging data in output",
      "evidence": ["Found 'staging' in response"],
      "duration_ms": 3000
    }
  },
  "summary": {
    "passed": 3,
    "failed": 1,
    "total": 4,
    "score": 0.85
  }
}
```

## Examples

```bash
# Full evaluation
/eval-agent architecture-designer

# Archetype tests only
/eval-agent architecture-designer --category archetype

# Single scenario with verbose output
/eval-agent test-engineer --scenario grounding-test --verbose

# Save results
/eval-agent security-architect --output .aiwg/reports/security-eval.json

# Strict mode (fails on any test failure)
/eval-agent devops-engineer --strict
```

## Success Criteria

| Metric | Target |
|--------|--------|
| Grounding (A1) | >90% |
| Substitution (A2) | >85% |
| Distractor (A3) | >80% |
| Recovery (A4) | ≥80% |
| Overall | ≥85% |

## Related Commands

- `/eval-workflow` - Test multi-agent workflows
- `/eval-report` - Generate quality report
- `aiwg lint agents` - Static validation

Evaluate agent: $ARGUMENTS

## References

- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/god-session.md — Single-responsibility rules that agents are evaluated against
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success criteria and threshold definitions
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC agent catalog available for evaluation
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint agents

Related Skills

gate-evaluation

104
from jmagly/aiwg

Validate phase gate criteria with multi-agent review and generate pass/fail reports

Codex

eval-workflow

104
from jmagly/aiwg

Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance

Codex

eval-report

104
from jmagly/aiwg

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

Codex

eval-loop

104
from jmagly/aiwg

Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met

Codex

aiwg-orchestrate

104
from jmagly/aiwg

Route structured artifact work to AIWG workflows via MCP with zero parent context cost

venv-manager

104
from jmagly/aiwg

Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.

pytest-runner

104
from jmagly/aiwg

Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.

vitest-runner

104
from jmagly/aiwg

Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.

eslint-checker

104
from jmagly/aiwg

Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.

repo-analyzer

104
from jmagly/aiwg

Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.

pr-reviewer

104
from jmagly/aiwg

Review GitHub pull requests for code quality, security, and best practices. Use for automated PR feedback and approval workflows.

YouTube Acquisition

104
from jmagly/aiwg

yt-dlp patterns for acquiring content from YouTube and video platforms