skill-forge-eval
Run evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns executor, grader, comparator, and analyzer sub-agents for parallel evaluation. Generates eval_metadata.json, grading.json, and feedback reports. Use when user says "eval skill", "test skill", "run evals", "evaluate skill", "skill evals", "test skill quality", "run skill tests", or "skill evaluation".
Best use case
skill-forge-eval is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Run evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns executor, grader, comparator, and analyzer sub-agents for parallel evaluation. Generates eval_metadata.json, grading.json, and feedback reports. Use when user says "eval skill", "test skill", "run evals", "evaluate skill", "skill evals", "test skill quality", "run skill tests", or "skill evaluation".
Teams using skill-forge-eval should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/skill-forge-eval/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How skill-forge-eval Compares
| Feature / Agent | skill-forge-eval | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns executor, grader, comparator, and analyzer sub-agents for parallel evaluation. Generates eval_metadata.json, grading.json, and feedback reports. Use when user says "eval skill", "test skill", "run evals", "evaluate skill", "skill evals", "test skill quality", "run skill tests", or "skill evaluation".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
SKILL.md Source
# Skill Evaluation Pipeline
Run structured evaluations against Claude Code skills to verify triggering,
correctness, and quality using a multi-agent pipeline.
## Process
### Step 1: Define Eval Set
Accept eval definitions from:
- **Path to eval set JSON**: `evals/evals.json` or user-specified file
- **Inline prompts**: User provides eval queries directly
- **Auto-generated**: Generate from skill description (see Step 1b)
**Eval set JSON schema:**
```json
{
"skill_name": "my-skill",
"skill_path": "./my-skill",
"evals": [
{
"eval_id": 0,
"eval_name": "descriptive-name",
"prompt": "The user's task prompt",
"input_files": [],
"assertions": [
{
"name": "output-has-score",
"check": "Output contains a numeric score between 0-100",
"weight": 1.0
}
],
"should_trigger": true
}
]
}
```
#### Step 1b: Auto-Generate Eval Set
If no eval set exists, generate one:
1. Read the skill's SKILL.md description and instructions
2. Run `python scripts/generate_eval_set.py <skill-path>` to produce a starter set
3. Present the generated set to the user for review and editing
4. User approves or modifies before proceeding
### Step 2: Set Up Workspace
Create the eval workspace **outside** the skill directory to avoid confusing eval
artifacts with skill files. Use a sibling directory or a dedicated location:
```
eval-workspace/
iteration-1/
eval-0/
eval_metadata.json # Assertions and config for this eval
with_skill/
outputs/ # Skill execution outputs
timing.json # Token count + duration
grading.json # Assertion results + evidence
baseline/
outputs/
timing.json
grading.json
eval-1/
eval_metadata.json
with_skill/
outputs/
timing.json
grading.json
baseline/
outputs/
timing.json
grading.json
benchmark.json # Aggregated metrics
benchmark.md # Human-readable report
```
For each eval directory, create `eval_metadata.json` from the eval set entry:
```json
{
"eval_id": 0,
"eval_name": "descriptive-name",
"prompt": "The user's task prompt",
"assertions": [...],
"should_trigger": true
}
```
### Step 3: Execute Eval Runs
For each eval in the set, spawn two parallel runs:
**With-skill run** (delegate to `agents/skill-forge-executor.md`):
```
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the assertions check>
```
**Baseline run** (delegate to `agents/skill-forge-executor.md`):
- For new skills: run without the skill loaded
- For improved skills: run with the previous version (snapshot it first)
Save timing data to `timing.json` in each run directory:
```json
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
```
### Step 4: Grade Results
Delegate to `agents/skill-forge-grader.md` for each completed run:
1. Grade against assertions defined in eval_metadata.json
2. Save results to `grading.json` per run:
```json
{
"eval_id": 0,
"run_type": "with_skill",
"assertions": [
{
"name": "output-has-score",
"passed": true,
"evidence": "Found score: 87/100 on line 14"
}
],
"pass_rate": 1.0
}
```
### Step 5: Aggregate and Analyze
1. Run `python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>`
2. This produces `benchmark.json` and `benchmark.md` with:
- Pass rate per eval (with_skill vs baseline)
- Average time and token usage
- Improvement ratio (with_skill / baseline)
3. Delegate to `agents/skill-forge-analyzer.md` to:
- Surface patterns that aggregate stats might hide
- Identify consistently failing assertion types
- Flag regressions from previous iterations
### Step 6: Present Results
Generate a summary report:
```markdown
# Eval Report: [skill-name] — Iteration [N]
## Overall
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | X% | Y% | +Z% |
| Avg Time | Xs | Ys | -Zs |
| Avg Tokens | X | Y | -Z |
## Per-Eval Results
| Eval | With Skill | Baseline | Status |
|------|-----------|----------|--------|
| eval-0 | PASS | FAIL | Improved |
| eval-1 | PASS | PASS | Maintained |
## Patterns & Insights
[From analyzer agent]
## Recommendations
[Specific improvements based on failures]
```
### Step 7: Collect Feedback
Save user feedback to `feedback.json`:
```json
{
"reviews": [
{
"run_id": "eval-0-with_skill",
"feedback": "the chart is missing axis labels",
"timestamp": "2026-03-06T12:00:00Z"
}
],
"status": "complete"
}
```
Pass feedback to `/skill-forge evolve` for the next iteration.
## Advanced: Blind Comparison
For rigorous A/B testing between skill versions:
1. Delegate to `agents/skill-forge-comparator.md`
2. Pass two directories: `eval-<ID>/with_skill/outputs/` and `eval-<ID>/baseline/outputs/`
3. Comparator assigns random labels (Version A / Version B) so it cannot know which is new
4. Rates each output on assertion criteria from `eval_metadata.json`
5. Returns preference scores without knowing which is "new" vs "old"
## Error Handling
- **Executor timeout**: If a run exceeds 5 minutes, terminate and mark as `"timed_out": true` in timing.json
- **Executor failure**: If a run crashes, save the error to `error.txt` in the run directory and continue with remaining evals
- **Grading failure**: If grading cannot determine pass/fail, mark assertion as `"passed": null` with evidence explaining why
- **Missing files**: If timing.json or grading.json is missing after a run, flag the eval as incomplete in the report
- **Partial completion**: Always aggregate and report whatever results are available — do not block on one failed eval
## Quality Gates
Before marking an eval run as complete:
- [ ] All evals executed (with_skill + baseline)
- [ ] Timing data captured for every run
- [ ] All assertions graded with evidence
- [ ] Benchmark aggregated with pass rate, time, tokens
- [ ] Analyzer patterns documented
- [ ] Results presented to userRelated Skills
skill-forge-review
Audit and validate existing Claude Code skills for quality, triggering accuracy, structure compliance, and best practices. Scores skills on a 0-100 scale and provides prioritized improvement recommendations. Use when user says "review skill", "audit skill", "check skill", "validate skill", or "skill quality".
skill-forge-publish
Package and distribute Claude Code skills for sharing via GitHub, Claude.ai uploads, or team deployment. Creates install scripts, documentation, and .skill packages. Use when user says "publish skill", "share skill", "package skill", "distribute skill", or "release skill".
skill-forge-plan
Architecture and design planning for new Claude Code skills. Guides through use case definition, complexity tier selection, sub-skill decomposition, and file structure planning. Use when user says "plan skill", "design skill", "skill architecture", or "skill planning".
skill-forge-evolve
Improve and iterate on existing Claude Code skills based on usage feedback, test results, or changing requirements. Handles under/over-triggering fixes, instruction refinement, new sub-skill addition, and architecture evolution. Use when user says "improve skill", "fix skill", "skill not triggering", "skill triggers too much", "update skill", or "evolve skill".
skill-forge-convert
Convert Claude Code skills to work on OpenAI Codex, Google Gemini CLI, Google Antigravity, and Cursor. Analyzes platform-specific features, generates target files (openai.yaml, AGENTS.md, GEMINI.md, .mdc rules), adapts frontmatter, converts MCP config, and produces compatibility reports. Use when user says "convert skill", "port skill", "multi-platform", "skill for codex", "skill for gemini", "skill for antigravity", "skill for cursor", "cross-platform skill", "convert to codex", "convert to gemini", "convert to antigravity", or "convert to cursor".
skill-forge-build
Scaffold and build Claude Code skills from plans or descriptions. Generates SKILL.md files, sub-skills, scripts, references, agents, and templates following the Agent Skills standard. Use when user says "build skill", "scaffold skill", "generate skill", "create SKILL.md", or "implement skill".
skill-forge-benchmark
Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".
skill-forge
Ultimate Claude Code skill creator and architect. Designs, scaffolds, builds, reviews, evolves, and publishes production-grade Claude Code skills following the Agent Skills open standard and 3-layer architecture (directive, orchestration, execution). Handles single-file skills, multi-skill orchestrators with sub-skills and subagents, MCP-enhanced workflows, and full skill ecosystems. Industry detection for skill domain. Triggers on: "create skill", "build skill", "new skill", "skill creator", "skill builder", "skill-forge", "design skill", "scaffold skill", "review skill", "improve skill", "publish skill", "skill architecture", "convert skill", "port skill", "multi-platform", "cross-platform", "eval skill", "test skill", "benchmark skill", "skill evals", "measure skill", "skill performance", "skill A/B test".
llm-evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
evaluation
Build evaluation frameworks for agent systems. Use when testing agent performance systematically, validating context engineering choices, or measuring improvements over time.
azure-mgmt-arizeaiobservabilityeval-dotnet
Azure Resource Manager SDK for Arize AI Observability and Evaluation (.NET).
ml-model-eval-benchmark
Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.