eval-workflow
Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
Best use case
eval-workflow is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
It is a strong fit for teams already working in Codex.
Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
Teams using eval-workflow should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/eval-workflow/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How eval-workflow Compares
| Feature / Agent | eval-workflow | Standard Approach |
|---|---|---|
| Platform Support | Codex | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
Which AI agents support this skill?
This skill is designed for Codex.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
SKILL.md Source
# Workflow Evaluation
Run automated evaluation tests against a multi-agent workflow.
## Research Foundation
- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for real agentic task evaluation
## Usage
```bash
/eval-workflow flow-security-review-cycle
/eval-workflow flow-inception-to-elaboration --scenario distractor-test
/eval-workflow flow-deploy-to-production --verbose --strict
```
## Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| workflow-name | Yes | Workflow (flow command) to evaluate |
## Options
| Option | Default | Description |
|--------|---------|-------------|
| --scenario | all | Specific scenario to run |
| --verbose | false | Show detailed test output |
| --output | stdout | Output file for results |
| --strict | false | Fail on any test failure |
| --timeout | 300 | Maximum seconds per scenario |
## What Gets Evaluated
### Orchestration Quality
- **Agent coordination**: Parallel agents launched correctly in single message
- **Handoff fidelity**: Artifacts pass correctly between phases
- **Gate enforcement**: Phase gates checked before transition
### Archetype Resistance
- `grounding-test` — Archetype 1: Premature action without reading state
- `distractor-test` — Archetype 3: Context pollution from irrelevant artifacts
- `recovery-test` — Archetype 4: Fragile execution when subagent fails
### Output Validation
- Required artifacts created in correct `.aiwg/` paths
- Document structure matches templates
- Traceability links intact
## Process
1. **Load Workflow**: Read flow command definition
2. **Select Scenarios**: Based on --scenario flag or all applicable
3. **Setup Workspace**: Create isolated `.aiwg/working/` test space
4. **Execute Flow**: Run workflow against each scenario
5. **Validate Outputs**: Check artifact presence, structure, and content
6. **Generate Report**: Output results with pass/fail per assertion
7. **Cleanup**: Remove test workspace
## Output Format
```json
{
"workflow": "flow-security-review-cycle",
"timestamp": "2026-04-01T10:30:00Z",
"scenarios": {
"grounding-test": {
"passed": true,
"score": 1.0,
"assertions": [
{"name": "threat-model-created", "passed": true},
{"name": "security-gate-run", "passed": true}
],
"duration_ms": 45000
},
"distractor-test": {
"passed": false,
"score": 0.7,
"assertions": [
{"name": "correct-assets-only", "passed": false, "evidence": "Distractor file referenced in output"}
],
"duration_ms": 38000
}
},
"summary": {
"passed": 4,
"failed": 1,
"total": 5,
"score": 0.80
}
}
```
## Examples
```bash
# Full evaluation of a workflow
/eval-workflow flow-security-review-cycle
# Single scenario with verbose output
/eval-workflow flow-inception-to-elaboration --scenario grounding-test --verbose
# Strict mode with output saved
/eval-workflow flow-deploy-to-production --strict --output .aiwg/reports/deploy-eval.json
```
## Success Criteria
| Metric | Target |
|--------|--------|
| Artifact creation | 100% |
| Grounding compliance | >90% |
| Distractor resistance | >80% |
| Recovery success | ≥80% |
| Overall | ≥85% |
## Related Commands
- `/eval-agent` - Test individual agents
- `/eval-report` - Generate aggregate quality report
- `aiwg lint agents` - Static validation
Evaluate workflow: $ARGUMENTS
## References
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Parallel agent coordination rules evaluated in workflows
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success thresholds and test criteria
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC flow commands available for workflow evaluation
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint and eval commandsRelated Skills
research-workflow
Execute multi-stage research workflows
gate-evaluation
Validate phase gate criteria with multi-agent review and generate pass/fail reports
eval-report
Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
eval-loop
Configure and run the isolated eval loop pattern — generate, evaluate, refine until pass threshold met
eval-agent
Run evaluation tests against an agent to assess quality and archetype resistance
approval-workflow
Route marketing assets through multi-stakeholder approval chains with status tracking and escalation
aiwg-orchestrate
Route structured artifact work to AIWG workflows via MCP with zero parent context cost
venv-manager
Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.
pytest-runner
Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.
vitest-runner
Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.
eslint-checker
Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.
repo-analyzer
Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.