AI Agent Skill HUB

Codex

eval-agent

Run evaluation tests against an agent to assess quality and archetype resistance

104 stars

View on GitHub Installation ↓

Best use case

eval-agent is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

It is a strong fit for teams already working in Codex.

Run evaluation tests against an agent to assess quality and archetype resistance

Teams using eval-agent should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-agent/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/.agents/skills/eval-agent/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/eval-agent/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How eval-agent Compares

Feature / Agent	eval-agent	Standard Approach
Platform Support	Codex	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Run evaluation tests against an agent to assess quality and archetype resistance

Which AI agents support this skill?

This skill is designed for Codex.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

SKILL.md Source

# Agent Evaluation

Run automated evaluation tests against an agent.

## Research Foundation

- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for failure archetype detection

## Usage

```bash
/eval-agent security-architect
/eval-agent architecture-designer --category archetype
/eval-agent test-engineer --scenario grounding-test --verbose
```

## Arguments

| Argument | Required | Description |
|----------|----------|-------------|
| agent-name | Yes | Agent to evaluate |

## Options

| Option | Default | Description |
|--------|---------|-------------|
| --category | all | Test category: archetype, performance, quality |
| --scenario | all | Specific scenario to run |
| --verbose | false | Show detailed test output |
| --output | stdout | Output file for results |
| --strict | false | Fail on any test failure |

## Test Categories

### archetype

Tests for Roig (2025) failure archetypes:

- `grounding-test` - Archetype 1: Premature action
- `substitution-test` - Archetype 2: Over-helpfulness
- `distractor-test` - Archetype 3: Context pollution
- `recovery-test` - Archetype 4: Fragile execution

### performance

- `latency-test` - Response time benchmarks
- `token-test` - Token efficiency
- `parallel-test` - Concurrent execution correctness

### quality

- `output-format` - Output structure validation
- `tool-usage` - Appropriate tool selection
- `scope-adherence` - Stays within defined scope

## Process

1. **Load Agent**: Read agent definition
2. **Select Scenarios**: Based on --category or --scenario
3. **Setup Environment**: Create test workspace
4. **Execute Tests**: Run agent against each scenario
5. **Validate Results**: Check assertions
6. **Generate Report**: Output results

## Output Format

```json
{
  "agent": "security-architect",
  "timestamp": "2025-01-15T10:30:00Z",
  "tests": {
    "grounding-test": {
      "passed": true,
      "score": 1.0,
      "details": "Read tool called before Edit",
      "duration_ms": 5000
    },
    "distractor-test": {
      "passed": false,
      "score": 0.6,
      "details": "Used staging data in output",
      "evidence": ["Found 'staging' in response"],
      "duration_ms": 3000
    }
  },
  "summary": {
    "passed": 3,
    "failed": 1,
    "total": 4,
    "score": 0.85
  }
}
```

## Examples

```bash
# Full evaluation
/eval-agent architecture-designer

# Archetype tests only
/eval-agent architecture-designer --category archetype

# Single scenario with verbose output
/eval-agent test-engineer --scenario grounding-test --verbose

# Save results
/eval-agent security-architect --output .aiwg/reports/security-eval.json

# Strict mode (fails on any test failure)
/eval-agent devops-engineer --strict
```

## Success Criteria

| Metric | Target |
|--------|--------|
| Grounding (A1) | >90% |
| Substitution (A2) | >85% |
| Distractor (A3) | >80% |
| Recovery (A4) | ≥80% |
| Overall | ≥85% |

## Related Commands

- `/eval-workflow` - Test multi-agent workflows
- `/eval-report` - Generate quality report
- `aiwg lint agents` - Static validation

Evaluate agent: $ARGUMENTS

## References

- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/god-session.md — Single-responsibility rules that agents are evaluated against
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success criteria and threshold definitions
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC agent catalog available for evaluation
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint agents

Related Skills

gate-evaluation

from jmagly/aiwg

Validate phase gate criteria with multi-agent review and generate pass/fail reports

eval-workflow

from jmagly/aiwg

Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance

eval-report

from jmagly/aiwg

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

eval-loop

from jmagly/aiwg

Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met

aiwg-orchestrate

from jmagly/aiwg

Route structured artifact work to AIWG workflows via MCP with zero parent context cost

venv-manager

from jmagly/aiwg

Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.

pytest-runner

from jmagly/aiwg

Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.

vitest-runner

from jmagly/aiwg

Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.

eslint-checker

from jmagly/aiwg

Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.

repo-analyzer

from jmagly/aiwg

Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.

pr-reviewer

from jmagly/aiwg

Review GitHub pull requests for code quality, security, and best practices. Use for automated PR feedback and approval workflows.

YouTube Acquisition

from jmagly/aiwg

yt-dlp patterns for acquiring content from YouTube and video platforms