eval-report
Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
Best use case
eval-report is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
It is a strong fit for teams already working in Codex.
Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
Teams using eval-report should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/eval-report/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How eval-report Compares
| Feature / Agent | eval-report | Standard Approach |
|---|---|---|
| Platform Support | Codex | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
Which AI agents support this skill?
This skill is designed for Codex.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
SKILL.md Source
# Evaluation Report
Generate a quality report from accumulated evaluation results.
## Research Foundation
- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for real agentic task evaluation
## Usage
```bash
/eval-report
/eval-report --output .aiwg/reports/quality-report.md
/eval-report --compare previous-report.json
/eval-report --mode sdlc --format json
```
## Options
| Option | Default | Description |
|--------|---------|-------------|
| --output | stdout | Output file path |
| --compare | none | Previous report to diff against |
| --mode | all | Agent category: sdlc, marketing, forensics, all |
| --format | markdown | Output format: markdown, json |
| --since | none | Only include results after this date (ISO 8601) |
| --threshold | 0.85 | Score below this triggers a warning |
## Process
1. **Collect Results**: Read all `eval-*.json` files from `.aiwg/reports/`
2. **Aggregate Scores**: Compute per-agent and per-archetype scores
3. **Detect Regressions**: Compare against --compare baseline if provided
4. **Rank Agents**: Sort by overall score, flag below-threshold agents
5. **Build Recommendations**: Surface specific agents and archetypes needing attention
6. **Output Report**: Write markdown or JSON to --output or stdout
## Report Sections
### Summary Dashboard
Overall health at a glance — total agents tested, aggregate score, regression count.
### By Archetype
Pass rates per Roig (2025) failure archetype across all agents.
### Agents Needing Attention
Agents below the --threshold, with consecutive-failure streaks flagged.
### Regression Analysis
When --compare is provided: agents whose scores dropped since the baseline.
### Recommendations
Prioritized action list: which agents to review, which archetypes to harden.
## Output Format (Markdown)
```markdown
# Agent Quality Report
**Generated**: 2026-04-01T10:30:00Z
**Agents Tested**: 58
**Overall Score**: 87%
**Regressions**: 2
## By Archetype
| Archetype | Pass Rate | Trend |
|-----------|-----------|-------|
| #1 Grounding | 92% | ↑ |
| #2 Substitution | 88% | → |
| #3 Distractor | 78% | ↓ |
| #4 Recovery | 90% | ↑ |
## Agents Needing Attention
| Agent | Score | Consecutive Failures | Issue |
|-------|-------|---------------------|-------|
| data-analyst | 72% | 3 | distractor-test |
| api-designer | 79% | 1 | latency regression (+40%) |
## Recommendations
1. Review `data-analyst` context filtering — failed distractor-test 3 consecutive runs
2. Investigate `api-designer` tool selection — latency regression
3. Increase distractor-test scenarios for marketing agents (78% pass rate below 80% target)
```
## Output Format (JSON)
```json
{
"generated": "2026-04-01T10:30:00Z",
"summary": {
"agents_tested": 58,
"overall_score": 0.87,
"regressions": 2
},
"by_archetype": {
"grounding": 0.92,
"substitution": 0.88,
"distractor": 0.78,
"recovery": 0.90
},
"agents_needing_attention": [
{"agent": "data-analyst", "score": 0.72, "consecutive_failures": 3, "issue": "distractor-test"}
],
"recommendations": [
"Review data-analyst context filtering"
]
}
```
## Examples
```bash
# Standard report to stdout
/eval-report
# Save to file
/eval-report --output .aiwg/reports/quality-$(date +%Y%m%d).md
# Compare against baseline
/eval-report --compare .aiwg/reports/quality-20260301.json
# JSON for CI consumption
/eval-report --format json --threshold 0.80
# SDLC agents only
/eval-report --mode sdlc
```
## Related Commands
- `/eval-agent` - Test individual agents
- `/eval-workflow` - Test multi-agent workflows
- `aiwg lint agents` - Static validation
Generate evaluation report: $ARGUMENTS
## References
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete threshold and scoring requirements
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context for agent evaluation scope
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for evaluation-related commandsRelated Skills
uat-report
Generate UAT completion report with tool coverage matrix, pass/fail metrics, and regression detection
sdlc-reports
Generate SDLC reports including iteration status, metrics dashboards, and executive summaries across phases
regression-report
Generate comprehensive regression analysis reports combining bisect, baseline, and metrics data with actionable recommendations
provenance-report
Generate provenance coverage dashboard and statistics
mention-report
Generate traceability report from @-mentions
grade-report
Generate corpus-wide GRADE quality distribution report
gate-evaluation
Validate phase gate criteria with multi-agent review and generate pass/fail reports
forensics-report
Generate forensic investigation report
eval-workflow
Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
eval-loop
Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met
eval-agent
Run evaluation tests against an agent to assess quality and archetype resistance
cost-report
Generate a cost and token-spending report for the current or most recent workflow session