skill-forge-benchmark

Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".

39 stars

Best use case

skill-forge-benchmark is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".

Teams using skill-forge-benchmark should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/skill-forge-benchmark/SKILL.md --create-dirs "https://raw.githubusercontent.com/AgriciDaniel/skill-forge/main/skills/skill-forge-benchmark/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/skill-forge-benchmark/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How skill-forge-benchmark Compares

Feature / Agentskill-forge-benchmarkStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Skill Benchmarking & Performance Tracking

Measure and compare skill performance across iterations with statistical
rigor using multiple trials, variance analysis, and trend tracking.

## Process

### Step 1: Define Benchmark Configuration

Accept configuration as:
- **Existing eval set**: Path to `evals/evals.json` (from `/skill-forge eval`)
- **Benchmark config**: Custom config with trial count and thresholds

**Benchmark config schema:**
```json
{
  "skill_name": "my-skill",
  "skill_path": "./my-skill",
  "eval_set_path": "./evals/evals.json",
  "trials_per_eval": 3,
  "baseline_type": "no_skill",
  "previous_benchmark": null,
  "thresholds": {
    "min_pass_rate": 0.8,
    "max_avg_tokens": 100000,
    "max_avg_duration_seconds": 120,
    "min_improvement_ratio": 1.0
  }
}
```

### Step 2: Execute Benchmark Runs

For each eval, run `trials_per_eval` times (default: 3) to get reliable metrics:

1. Execute with-skill runs (3x per eval)
2. Execute baseline runs (3x per eval)
3. Capture per-run: pass/fail, token count, duration
4. Save each run's `timing.json` and `grading.json`

Use `agents/skill-forge-executor.md` for parallel execution where possible.

### Step 3: Aggregate Results

Run `python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>`:

**Output `benchmark.json` schema:**
```json
{
  "skill_name": "my-skill",
  "iteration": 1,
  "timestamp": "2026-03-06T12:00:00Z",
  "summary": {
    "total_evals": 10,
    "with_skill": {
      "pass_rate": 0.87,
      "pass_rate_std": 0.05,
      "avg_tokens": 45000,
      "avg_duration_seconds": 34.2
    },
    "baseline": {
      "pass_rate": 0.60,
      "pass_rate_std": 0.08,
      "avg_tokens": 62000,
      "avg_duration_seconds": 52.1
    },
    "improvement_ratio": 1.45,
    "token_savings_ratio": 0.73,
    "time_savings_ratio": 0.66
  },
  "per_eval": [
    {
      "eval_id": 0,
      "eval_name": "basic-trigger",
      "with_skill": {"pass_rate": 1.0, "avg_tokens": 30000, "avg_duration_seconds": 20.1},
      "baseline": {"pass_rate": 0.67, "avg_tokens": 50000, "avg_duration_seconds": 45.0},
      "trials": 3
    }
  ],
  "thresholds_met": {
    "min_pass_rate": true,
    "max_avg_tokens": true,
    "max_avg_duration_seconds": true,
    "min_improvement_ratio": true
  }
}
```

### Step 4: Compare with Previous Iterations

If `previous_benchmark` is provided or prior `iteration-<N-1>` exists:

1. Load previous `benchmark.json`
2. Calculate delta per metric:
   - Pass rate change
   - Token usage change
   - Duration change
   - New regressions (evals that passed before but fail now)
   - New improvements (evals that failed before but pass now)

### Step 5: Generate Benchmark Report

```markdown
# Benchmark Report: [skill-name]

## Iteration [N] vs [N-1]

### Summary
| Metric | Current | Previous | Delta | Threshold | Status |
|--------|---------|----------|-------|-----------|--------|
| Pass Rate | 87% | 78% | +9% | >= 80% | PASS |
| Avg Tokens | 45K | 52K | -13% | <= 100K | PASS |
| Avg Time | 34s | 41s | -17% | <= 120s | PASS |
| Improvement | 1.45x | 1.30x | +0.15x | >= 1.0x | PASS |

### Regressions (Action Required)
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-5 | PASS | FAIL | Output missing required section |

### Improvements
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-3 | FAIL | PASS | Error handling now works |

### Per-Eval Detail
[Full breakdown table]

### Variance Analysis
| Eval | Pass Rate | Std Dev | Trials | Reliability |
|------|-----------|---------|--------|-------------|
| eval-0 | 100% | 0.00 | 3 | High |
| eval-1 | 67% | 0.47 | 3 | Low (investigate) |

### Recommendations
[Based on regressions, low-reliability evals, and threshold failures]
```

### Step 6: Threshold Gating

If any threshold fails:
1. Flag as **FAIL** with specific threshold details
2. List which evals caused the failure
3. Recommend running `/skill-forge evolve` to address issues
4. Do NOT approve for publish until thresholds pass

## Error Handling

- **Flaky trials**: If a trial times out or crashes, exclude it from variance calculation and note `"trials_completed"` vs `"trials_requested"` in per-eval results
- **Insufficient trials**: If fewer than 2 trials complete for an eval, flag variance as `"unreliable"` in the report
- **Missing baseline**: If baseline runs fail entirely, report with-skill results only and skip improvement_ratio
- **Threshold edge cases**: If pass_rate equals the threshold exactly, treat as PASS

## Integration with Other Sub-Skills

- **skill-forge-eval**: Provides the eval set and grading infrastructure
- **skill-forge-evolve**: Receives benchmark failures as improvement targets
- **skill-forge-publish**: Requires benchmark pass (score >= thresholds) before publish
- **skill-forge-review**: Can include benchmark summary in review report

Related Skills

skill-forge-review

39
from AgriciDaniel/skill-forge

Audit and validate existing Claude Code skills for quality, triggering accuracy, structure compliance, and best practices. Scores skills on a 0-100 scale and provides prioritized improvement recommendations. Use when user says "review skill", "audit skill", "check skill", "validate skill", or "skill quality".

skill-forge-publish

39
from AgriciDaniel/skill-forge

Package and distribute Claude Code skills for sharing via GitHub, Claude.ai uploads, or team deployment. Creates install scripts, documentation, and .skill packages. Use when user says "publish skill", "share skill", "package skill", "distribute skill", or "release skill".

skill-forge-plan

39
from AgriciDaniel/skill-forge

Architecture and design planning for new Claude Code skills. Guides through use case definition, complexity tier selection, sub-skill decomposition, and file structure planning. Use when user says "plan skill", "design skill", "skill architecture", or "skill planning".

skill-forge-evolve

39
from AgriciDaniel/skill-forge

Improve and iterate on existing Claude Code skills based on usage feedback, test results, or changing requirements. Handles under/over-triggering fixes, instruction refinement, new sub-skill addition, and architecture evolution. Use when user says "improve skill", "fix skill", "skill not triggering", "skill triggers too much", "update skill", or "evolve skill".

skill-forge-eval

39
from AgriciDaniel/skill-forge

Run evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns executor, grader, comparator, and analyzer sub-agents for parallel evaluation. Generates eval_metadata.json, grading.json, and feedback reports. Use when user says "eval skill", "test skill", "run evals", "evaluate skill", "skill evals", "test skill quality", "run skill tests", or "skill evaluation".

skill-forge-convert

39
from AgriciDaniel/skill-forge

Convert Claude Code skills to work on OpenAI Codex, Google Gemini CLI, Google Antigravity, and Cursor. Analyzes platform-specific features, generates target files (openai.yaml, AGENTS.md, GEMINI.md, .mdc rules), adapts frontmatter, converts MCP config, and produces compatibility reports. Use when user says "convert skill", "port skill", "multi-platform", "skill for codex", "skill for gemini", "skill for antigravity", "skill for cursor", "cross-platform skill", "convert to codex", "convert to gemini", "convert to antigravity", or "convert to cursor".

skill-forge-build

39
from AgriciDaniel/skill-forge

Scaffold and build Claude Code skills from plans or descriptions. Generates SKILL.md files, sub-skills, scripts, references, agents, and templates following the Agent Skills standard. Use when user says "build skill", "scaffold skill", "generate skill", "create SKILL.md", or "implement skill".

skill-forge

39
from AgriciDaniel/skill-forge

Ultimate Claude Code skill creator and architect. Designs, scaffolds, builds, reviews, evolves, and publishes production-grade Claude Code skills following the Agent Skills open standard and 3-layer architecture (directive, orchestration, execution). Handles single-file skills, multi-skill orchestrators with sub-skills and subagents, MCP-enhanced workflows, and full skill ecosystems. Industry detection for skill domain. Triggers on: "create skill", "build skill", "new skill", "skill creator", "skill builder", "skill-forge", "design skill", "scaffold skill", "review skill", "improve skill", "publish skill", "skill architecture", "convert skill", "port skill", "multi-platform", "cross-platform", "eval skill", "test skill", "benchmark skill", "skill evals", "measure skill", "skill performance", "skill A/B test".

Compensation & Salary Benchmarking Planner

3891
from openclaw/skills

Build data-driven compensation structures that attract talent without overpaying. Covers base salary bands, equity/bonus frameworks, geographic differentials, and total rewards packaging.

HR & Compensation Management

ml-model-eval-benchmark

3891
from openclaw/skills

Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.

Machine Learning

benchmark

63951
from garrytan/gstack

Performance regression detection using the browse daemon. Establishes baselines for page load times, Core Web Vitals, and resource sizes. Compares before/after on every PR. Tracks performance trends over time. Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals", "bundle size", "load time". (gstack) Voice triggers (speech-to-text aliases): "speed test", "check performance".

plugin-forge

24269
from davila7/claude-code-templates

Create and manage Claude Code plugins with proper structure, manifests, and marketplace integration. Use when creating plugins for a marketplace, adding plugin components (commands, agents, hooks), bumping plugin versions, or working with plugin.json/marketplace.json manifests.