skill-forge-eval

Run evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns executor, grader, comparator, and analyzer sub-agents for parallel evaluation. Generates eval_metadata.json, grading.json, and feedback reports. Use when user says "eval skill", "test skill", "run evals", "evaluate skill", "skill evals", "test skill quality", "run skill tests", or "skill evaluation".

39 stars

byAgriciDaniel

View on GitHub Installation ↓

Best use case

skill-forge-eval is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using skill-forge-eval should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/skill-forge-eval/SKILL.md --create-dirs "https://raw.githubusercontent.com/AgriciDaniel/skill-forge/main/skills/skill-forge-eval/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/skill-forge-eval/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How skill-forge-eval Compares

Feature / Agent	skill-forge-eval	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

SKILL.md Source

# Skill Evaluation Pipeline

Run structured evaluations against Claude Code skills to verify triggering,
correctness, and quality using a multi-agent pipeline.

## Process

### Step 1: Define Eval Set

Accept eval definitions from:
- **Path to eval set JSON**: `evals/evals.json` or user-specified file
- **Inline prompts**: User provides eval queries directly
- **Auto-generated**: Generate from skill description (see Step 1b)

**Eval set JSON schema:**
```json
{
  "skill_name": "my-skill",
  "skill_path": "./my-skill",
  "evals": [
    {
      "eval_id": 0,
      "eval_name": "descriptive-name",
      "prompt": "The user's task prompt",
      "input_files": [],
      "assertions": [
        {
          "name": "output-has-score",
          "check": "Output contains a numeric score between 0-100",
          "weight": 1.0
        }
      ],
      "should_trigger": true
    }
  ]
}
```

#### Step 1b: Auto-Generate Eval Set

If no eval set exists, generate one:
1. Read the skill's SKILL.md description and instructions
2. Run `python scripts/generate_eval_set.py <skill-path>` to produce a starter set
3. Present the generated set to the user for review and editing
4. User approves or modifies before proceeding

### Step 2: Set Up Workspace

Create the eval workspace **outside** the skill directory to avoid confusing eval
artifacts with skill files. Use a sibling directory or a dedicated location:

```
eval-workspace/
  iteration-1/
    eval-0/
      eval_metadata.json        # Assertions and config for this eval
      with_skill/
        outputs/                # Skill execution outputs
        timing.json             # Token count + duration
        grading.json            # Assertion results + evidence
      baseline/
        outputs/
        timing.json
        grading.json
    eval-1/
      eval_metadata.json
      with_skill/
        outputs/
        timing.json
        grading.json
      baseline/
        outputs/
        timing.json
        grading.json
    benchmark.json              # Aggregated metrics
    benchmark.md                # Human-readable report
```

For each eval directory, create `eval_metadata.json` from the eval set entry:
```json
{
  "eval_id": 0,
  "eval_name": "descriptive-name",
  "prompt": "The user's task prompt",
  "assertions": [...],
  "should_trigger": true
}
```

### Step 3: Execute Eval Runs

For each eval in the set, spawn two parallel runs:

**With-skill run** (delegate to `agents/skill-forge-executor.md`):
```
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the assertions check>
```

**Baseline run** (delegate to `agents/skill-forge-executor.md`):
- For new skills: run without the skill loaded
- For improved skills: run with the previous version (snapshot it first)

Save timing data to `timing.json` in each run directory:
```json
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}
```

### Step 4: Grade Results

Delegate to `agents/skill-forge-grader.md` for each completed run:

1. Grade against assertions defined in eval_metadata.json
2. Save results to `grading.json` per run:
```json
{
  "eval_id": 0,
  "run_type": "with_skill",
  "assertions": [
    {
      "name": "output-has-score",
      "passed": true,
      "evidence": "Found score: 87/100 on line 14"
    }
  ],
  "pass_rate": 1.0
}
```

### Step 5: Aggregate and Analyze

1. Run `python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>`
2. This produces `benchmark.json` and `benchmark.md` with:
   - Pass rate per eval (with_skill vs baseline)
   - Average time and token usage
   - Improvement ratio (with_skill / baseline)

3. Delegate to `agents/skill-forge-analyzer.md` to:
   - Surface patterns that aggregate stats might hide
   - Identify consistently failing assertion types
   - Flag regressions from previous iterations

### Step 6: Present Results

Generate a summary report:

```markdown
# Eval Report: [skill-name] — Iteration [N]

## Overall
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | X% | Y% | +Z% |
| Avg Time | Xs | Ys | -Zs |
| Avg Tokens | X | Y | -Z |

## Per-Eval Results
| Eval | With Skill | Baseline | Status |
|------|-----------|----------|--------|
| eval-0 | PASS | FAIL | Improved |
| eval-1 | PASS | PASS | Maintained |

## Patterns & Insights
[From analyzer agent]

## Recommendations
[Specific improvements based on failures]
```

### Step 7: Collect Feedback

Save user feedback to `feedback.json`:
```json
{
  "reviews": [
    {
      "run_id": "eval-0-with_skill",
      "feedback": "the chart is missing axis labels",
      "timestamp": "2026-03-06T12:00:00Z"
    }
  ],
  "status": "complete"
}
```

Pass feedback to `/skill-forge evolve` for the next iteration.

## Advanced: Blind Comparison

For rigorous A/B testing between skill versions:
1. Delegate to `agents/skill-forge-comparator.md`
2. Pass two directories: `eval-<ID>/with_skill/outputs/` and `eval-<ID>/baseline/outputs/`
3. Comparator assigns random labels (Version A / Version B) so it cannot know which is new
4. Rates each output on assertion criteria from `eval_metadata.json`
5. Returns preference scores without knowing which is "new" vs "old"

## Error Handling

- **Executor timeout**: If a run exceeds 5 minutes, terminate and mark as `"timed_out": true` in timing.json
- **Executor failure**: If a run crashes, save the error to `error.txt` in the run directory and continue with remaining evals
- **Grading failure**: If grading cannot determine pass/fail, mark assertion as `"passed": null` with evidence explaining why
- **Missing files**: If timing.json or grading.json is missing after a run, flag the eval as incomplete in the report
- **Partial completion**: Always aggregate and report whatever results are available — do not block on one failed eval

## Quality Gates

Before marking an eval run as complete:
- [ ] All evals executed (with_skill + baseline)
- [ ] Timing data captured for every run
- [ ] All assertions graded with evidence
- [ ] Benchmark aggregated with pass rate, time, tokens
- [ ] Analyzer patterns documented
- [ ] Results presented to user

Related Skills

skill-forge-review

from AgriciDaniel/skill-forge

Audit and validate existing Claude Code skills for quality, triggering accuracy, structure compliance, and best practices. Scores skills on a 0-100 scale and provides prioritized improvement recommendations. Use when user says "review skill", "audit skill", "check skill", "validate skill", or "skill quality".

skill-forge-publish

from AgriciDaniel/skill-forge

Package and distribute Claude Code skills for sharing via GitHub, Claude.ai uploads, or team deployment. Creates install scripts, documentation, and .skill packages. Use when user says "publish skill", "share skill", "package skill", "distribute skill", or "release skill".

skill-forge-plan

from AgriciDaniel/skill-forge

Architecture and design planning for new Claude Code skills. Guides through use case definition, complexity tier selection, sub-skill decomposition, and file structure planning. Use when user says "plan skill", "design skill", "skill architecture", or "skill planning".

skill-forge-evolve

from AgriciDaniel/skill-forge

Improve and iterate on existing Claude Code skills based on usage feedback, test results, or changing requirements. Handles under/over-triggering fixes, instruction refinement, new sub-skill addition, and architecture evolution. Use when user says "improve skill", "fix skill", "skill not triggering", "skill triggers too much", "update skill", or "evolve skill".

skill-forge-convert

from AgriciDaniel/skill-forge

Convert Claude Code skills to work on OpenAI Codex, Google Gemini CLI, Google Antigravity, and Cursor. Analyzes platform-specific features, generates target files (openai.yaml, AGENTS.md, GEMINI.md, .mdc rules), adapts frontmatter, converts MCP config, and produces compatibility reports. Use when user says "convert skill", "port skill", "multi-platform", "skill for codex", "skill for gemini", "skill for antigravity", "skill for cursor", "cross-platform skill", "convert to codex", "convert to gemini", "convert to antigravity", or "convert to cursor".

skill-forge-build

from AgriciDaniel/skill-forge

Scaffold and build Claude Code skills from plans or descriptions. Generates SKILL.md files, sub-skills, scripts, references, agents, and templates following the Agent Skills standard. Use when user says "build skill", "scaffold skill", "generate skill", "create SKILL.md", or "implement skill".

skill-forge-benchmark

from AgriciDaniel/skill-forge

Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".

skill-forge

from AgriciDaniel/skill-forge

Ultimate Claude Code skill creator and architect. Designs, scaffolds, builds, reviews, evolves, and publishes production-grade Claude Code skills following the Agent Skills open standard and 3-layer architecture (directive, orchestration, execution). Handles single-file skills, multi-skill orchestrators with sub-skills and subagents, MCP-enhanced workflows, and full skill ecosystems. Industry detection for skill domain. Triggers on: "create skill", "build skill", "new skill", "skill creator", "skill builder", "skill-forge", "design skill", "scaffold skill", "review skill", "improve skill", "publish skill", "skill architecture", "convert skill", "port skill", "multi-platform", "cross-platform", "eval skill", "test skill", "benchmark skill", "skill evals", "measure skill", "skill performance", "skill A/B test".

llm-evaluation

31392

from sickn33/antigravity-awesome-skills

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

evaluation

31392

from sickn33/antigravity-awesome-skills

Build evaluation frameworks for agent systems. Use when testing agent performance systematically, validating context engineering choices, or measuring improvements over time.

azure-mgmt-arizeaiobservabilityeval-dotnet

31392

from sickn33/antigravity-awesome-skills

Azure Resource Manager SDK for Arize AI Observability and Evaluation (.NET).

ml-model-eval-benchmark

3891

from openclaw/skills

Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.

Machine Learning