Codex

eval-report

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

104 stars

Best use case

eval-report is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

It is a strong fit for teams already working in Codex.

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

Teams using eval-report should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/eval-report/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/.agents/skills/eval-report/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/eval-report/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How eval-report Compares

Feature / Agenteval-reportStandard Approach
Platform SupportCodexLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

Which AI agents support this skill?

This skill is designed for Codex.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Evaluation Report

Generate a quality report from accumulated evaluation results.

## Research Foundation

- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for real agentic task evaluation

## Usage

```bash
/eval-report
/eval-report --output .aiwg/reports/quality-report.md
/eval-report --compare previous-report.json
/eval-report --mode sdlc --format json
```

## Options

| Option | Default | Description |
|--------|---------|-------------|
| --output | stdout | Output file path |
| --compare | none | Previous report to diff against |
| --mode | all | Agent category: sdlc, marketing, forensics, all |
| --format | markdown | Output format: markdown, json |
| --since | none | Only include results after this date (ISO 8601) |
| --threshold | 0.85 | Score below this triggers a warning |

## Process

1. **Collect Results**: Read all `eval-*.json` files from `.aiwg/reports/`
2. **Aggregate Scores**: Compute per-agent and per-archetype scores
3. **Detect Regressions**: Compare against --compare baseline if provided
4. **Rank Agents**: Sort by overall score, flag below-threshold agents
5. **Build Recommendations**: Surface specific agents and archetypes needing attention
6. **Output Report**: Write markdown or JSON to --output or stdout

## Report Sections

### Summary Dashboard

Overall health at a glance — total agents tested, aggregate score, regression count.

### By Archetype

Pass rates per Roig (2025) failure archetype across all agents.

### Agents Needing Attention

Agents below the --threshold, with consecutive-failure streaks flagged.

### Regression Analysis

When --compare is provided: agents whose scores dropped since the baseline.

### Recommendations

Prioritized action list: which agents to review, which archetypes to harden.

## Output Format (Markdown)

```markdown
# Agent Quality Report

**Generated**: 2026-04-01T10:30:00Z
**Agents Tested**: 58
**Overall Score**: 87%
**Regressions**: 2

## By Archetype

| Archetype | Pass Rate | Trend |
|-----------|-----------|-------|
| #1 Grounding | 92% | ↑ |
| #2 Substitution | 88% | → |
| #3 Distractor | 78% | ↓ |
| #4 Recovery | 90% | ↑ |

## Agents Needing Attention

| Agent | Score | Consecutive Failures | Issue |
|-------|-------|---------------------|-------|
| data-analyst | 72% | 3 | distractor-test |
| api-designer | 79% | 1 | latency regression (+40%) |

## Recommendations

1. Review `data-analyst` context filtering — failed distractor-test 3 consecutive runs
2. Investigate `api-designer` tool selection — latency regression
3. Increase distractor-test scenarios for marketing agents (78% pass rate below 80% target)
```

## Output Format (JSON)

```json
{
  "generated": "2026-04-01T10:30:00Z",
  "summary": {
    "agents_tested": 58,
    "overall_score": 0.87,
    "regressions": 2
  },
  "by_archetype": {
    "grounding": 0.92,
    "substitution": 0.88,
    "distractor": 0.78,
    "recovery": 0.90
  },
  "agents_needing_attention": [
    {"agent": "data-analyst", "score": 0.72, "consecutive_failures": 3, "issue": "distractor-test"}
  ],
  "recommendations": [
    "Review data-analyst context filtering"
  ]
}
```

## Examples

```bash
# Standard report to stdout
/eval-report

# Save to file
/eval-report --output .aiwg/reports/quality-$(date +%Y%m%d).md

# Compare against baseline
/eval-report --compare .aiwg/reports/quality-20260301.json

# JSON for CI consumption
/eval-report --format json --threshold 0.80

# SDLC agents only
/eval-report --mode sdlc
```

## Related Commands

- `/eval-agent` - Test individual agents
- `/eval-workflow` - Test multi-agent workflows
- `aiwg lint agents` - Static validation

Generate evaluation report: $ARGUMENTS

## References

- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete threshold and scoring requirements
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context for agent evaluation scope
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for evaluation-related commands