eval-agent
Run evaluation tests against an agent to assess quality and archetype resistance
Best use case
eval-agent is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
It is a strong fit for teams already working in Codex.
Run evaluation tests against an agent to assess quality and archetype resistance
Teams using eval-agent should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/eval-agent/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How eval-agent Compares
| Feature / Agent | eval-agent | Standard Approach |
|---|---|---|
| Platform Support | Codex | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run evaluation tests against an agent to assess quality and archetype resistance
Which AI agents support this skill?
This skill is designed for Codex.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
SKILL.md Source
# Agent Evaluation
Run automated evaluation tests against an agent.
## Research Foundation
- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for failure archetype detection
## Usage
```bash
/eval-agent security-architect
/eval-agent architecture-designer --category archetype
/eval-agent test-engineer --scenario grounding-test --verbose
```
## Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| agent-name | Yes | Agent to evaluate |
## Options
| Option | Default | Description |
|--------|---------|-------------|
| --category | all | Test category: archetype, performance, quality |
| --scenario | all | Specific scenario to run |
| --verbose | false | Show detailed test output |
| --output | stdout | Output file for results |
| --strict | false | Fail on any test failure |
## Test Categories
### archetype
Tests for Roig (2025) failure archetypes:
- `grounding-test` - Archetype 1: Premature action
- `substitution-test` - Archetype 2: Over-helpfulness
- `distractor-test` - Archetype 3: Context pollution
- `recovery-test` - Archetype 4: Fragile execution
### performance
- `latency-test` - Response time benchmarks
- `token-test` - Token efficiency
- `parallel-test` - Concurrent execution correctness
### quality
- `output-format` - Output structure validation
- `tool-usage` - Appropriate tool selection
- `scope-adherence` - Stays within defined scope
## Process
1. **Load Agent**: Read agent definition
2. **Select Scenarios**: Based on --category or --scenario
3. **Setup Environment**: Create test workspace
4. **Execute Tests**: Run agent against each scenario
5. **Validate Results**: Check assertions
6. **Generate Report**: Output results
## Output Format
```json
{
"agent": "security-architect",
"timestamp": "2025-01-15T10:30:00Z",
"tests": {
"grounding-test": {
"passed": true,
"score": 1.0,
"details": "Read tool called before Edit",
"duration_ms": 5000
},
"distractor-test": {
"passed": false,
"score": 0.6,
"details": "Used staging data in output",
"evidence": ["Found 'staging' in response"],
"duration_ms": 3000
}
},
"summary": {
"passed": 3,
"failed": 1,
"total": 4,
"score": 0.85
}
}
```
## Examples
```bash
# Full evaluation
/eval-agent architecture-designer
# Archetype tests only
/eval-agent architecture-designer --category archetype
# Single scenario with verbose output
/eval-agent test-engineer --scenario grounding-test --verbose
# Save results
/eval-agent security-architect --output .aiwg/reports/security-eval.json
# Strict mode (fails on any test failure)
/eval-agent devops-engineer --strict
```
## Success Criteria
| Metric | Target |
|--------|--------|
| Grounding (A1) | >90% |
| Substitution (A2) | >85% |
| Distractor (A3) | >80% |
| Recovery (A4) | ≥80% |
| Overall | ≥85% |
## Related Commands
- `/eval-workflow` - Test multi-agent workflows
- `/eval-report` - Generate quality report
- `aiwg lint agents` - Static validation
Evaluate agent: $ARGUMENTS
## References
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/god-session.md — Single-responsibility rules that agents are evaluated against
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success criteria and threshold definitions
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC agent catalog available for evaluation
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint agentsRelated Skills
gate-evaluation
Validate phase gate criteria with multi-agent review and generate pass/fail reports
eval-workflow
Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
eval-report
Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
eval-loop
Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met
aiwg-orchestrate
Route structured artifact work to AIWG workflows via MCP with zero parent context cost
venv-manager
Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.
pytest-runner
Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.
vitest-runner
Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.
eslint-checker
Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.
repo-analyzer
Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.
pr-reviewer
Review GitHub pull requests for code quality, security, and best practices. Use for automated PR feedback and approval workflows.
YouTube Acquisition
yt-dlp patterns for acquiring content from YouTube and video platforms