skill-forge-benchmark
Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".
Best use case
skill-forge-benchmark is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".
Teams using skill-forge-benchmark should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/skill-forge-benchmark/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How skill-forge-benchmark Compares
| Feature / Agent | skill-forge-benchmark | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Skill Benchmarking & Performance Tracking
Measure and compare skill performance across iterations with statistical
rigor using multiple trials, variance analysis, and trend tracking.
## Process
### Step 1: Define Benchmark Configuration
Accept configuration as:
- **Existing eval set**: Path to `evals/evals.json` (from `/skill-forge eval`)
- **Benchmark config**: Custom config with trial count and thresholds
**Benchmark config schema:**
```json
{
"skill_name": "my-skill",
"skill_path": "./my-skill",
"eval_set_path": "./evals/evals.json",
"trials_per_eval": 3,
"baseline_type": "no_skill",
"previous_benchmark": null,
"thresholds": {
"min_pass_rate": 0.8,
"max_avg_tokens": 100000,
"max_avg_duration_seconds": 120,
"min_improvement_ratio": 1.0
}
}
```
### Step 2: Execute Benchmark Runs
For each eval, run `trials_per_eval` times (default: 3) to get reliable metrics:
1. Execute with-skill runs (3x per eval)
2. Execute baseline runs (3x per eval)
3. Capture per-run: pass/fail, token count, duration
4. Save each run's `timing.json` and `grading.json`
Use `agents/skill-forge-executor.md` for parallel execution where possible.
### Step 3: Aggregate Results
Run `python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>`:
**Output `benchmark.json` schema:**
```json
{
"skill_name": "my-skill",
"iteration": 1,
"timestamp": "2026-03-06T12:00:00Z",
"summary": {
"total_evals": 10,
"with_skill": {
"pass_rate": 0.87,
"pass_rate_std": 0.05,
"avg_tokens": 45000,
"avg_duration_seconds": 34.2
},
"baseline": {
"pass_rate": 0.60,
"pass_rate_std": 0.08,
"avg_tokens": 62000,
"avg_duration_seconds": 52.1
},
"improvement_ratio": 1.45,
"token_savings_ratio": 0.73,
"time_savings_ratio": 0.66
},
"per_eval": [
{
"eval_id": 0,
"eval_name": "basic-trigger",
"with_skill": {"pass_rate": 1.0, "avg_tokens": 30000, "avg_duration_seconds": 20.1},
"baseline": {"pass_rate": 0.67, "avg_tokens": 50000, "avg_duration_seconds": 45.0},
"trials": 3
}
],
"thresholds_met": {
"min_pass_rate": true,
"max_avg_tokens": true,
"max_avg_duration_seconds": true,
"min_improvement_ratio": true
}
}
```
### Step 4: Compare with Previous Iterations
If `previous_benchmark` is provided or prior `iteration-<N-1>` exists:
1. Load previous `benchmark.json`
2. Calculate delta per metric:
- Pass rate change
- Token usage change
- Duration change
- New regressions (evals that passed before but fail now)
- New improvements (evals that failed before but pass now)
### Step 5: Generate Benchmark Report
```markdown
# Benchmark Report: [skill-name]
## Iteration [N] vs [N-1]
### Summary
| Metric | Current | Previous | Delta | Threshold | Status |
|--------|---------|----------|-------|-----------|--------|
| Pass Rate | 87% | 78% | +9% | >= 80% | PASS |
| Avg Tokens | 45K | 52K | -13% | <= 100K | PASS |
| Avg Time | 34s | 41s | -17% | <= 120s | PASS |
| Improvement | 1.45x | 1.30x | +0.15x | >= 1.0x | PASS |
### Regressions (Action Required)
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-5 | PASS | FAIL | Output missing required section |
### Improvements
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-3 | FAIL | PASS | Error handling now works |
### Per-Eval Detail
[Full breakdown table]
### Variance Analysis
| Eval | Pass Rate | Std Dev | Trials | Reliability |
|------|-----------|---------|--------|-------------|
| eval-0 | 100% | 0.00 | 3 | High |
| eval-1 | 67% | 0.47 | 3 | Low (investigate) |
### Recommendations
[Based on regressions, low-reliability evals, and threshold failures]
```
### Step 6: Threshold Gating
If any threshold fails:
1. Flag as **FAIL** with specific threshold details
2. List which evals caused the failure
3. Recommend running `/skill-forge evolve` to address issues
4. Do NOT approve for publish until thresholds pass
## Error Handling
- **Flaky trials**: If a trial times out or crashes, exclude it from variance calculation and note `"trials_completed"` vs `"trials_requested"` in per-eval results
- **Insufficient trials**: If fewer than 2 trials complete for an eval, flag variance as `"unreliable"` in the report
- **Missing baseline**: If baseline runs fail entirely, report with-skill results only and skip improvement_ratio
- **Threshold edge cases**: If pass_rate equals the threshold exactly, treat as PASS
## Integration with Other Sub-Skills
- **skill-forge-eval**: Provides the eval set and grading infrastructure
- **skill-forge-evolve**: Receives benchmark failures as improvement targets
- **skill-forge-publish**: Requires benchmark pass (score >= thresholds) before publish
- **skill-forge-review**: Can include benchmark summary in review reportRelated Skills
skill-forge-review
Audit and validate existing Claude Code skills for quality, triggering accuracy, structure compliance, and best practices. Scores skills on a 0-100 scale and provides prioritized improvement recommendations. Use when user says "review skill", "audit skill", "check skill", "validate skill", or "skill quality".
skill-forge-publish
Package and distribute Claude Code skills for sharing via GitHub, Claude.ai uploads, or team deployment. Creates install scripts, documentation, and .skill packages. Use when user says "publish skill", "share skill", "package skill", "distribute skill", or "release skill".
skill-forge-plan
Architecture and design planning for new Claude Code skills. Guides through use case definition, complexity tier selection, sub-skill decomposition, and file structure planning. Use when user says "plan skill", "design skill", "skill architecture", or "skill planning".
skill-forge-evolve
Improve and iterate on existing Claude Code skills based on usage feedback, test results, or changing requirements. Handles under/over-triggering fixes, instruction refinement, new sub-skill addition, and architecture evolution. Use when user says "improve skill", "fix skill", "skill not triggering", "skill triggers too much", "update skill", or "evolve skill".
skill-forge-eval
Run evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns executor, grader, comparator, and analyzer sub-agents for parallel evaluation. Generates eval_metadata.json, grading.json, and feedback reports. Use when user says "eval skill", "test skill", "run evals", "evaluate skill", "skill evals", "test skill quality", "run skill tests", or "skill evaluation".
skill-forge-convert
Convert Claude Code skills to work on OpenAI Codex, Google Gemini CLI, Google Antigravity, and Cursor. Analyzes platform-specific features, generates target files (openai.yaml, AGENTS.md, GEMINI.md, .mdc rules), adapts frontmatter, converts MCP config, and produces compatibility reports. Use when user says "convert skill", "port skill", "multi-platform", "skill for codex", "skill for gemini", "skill for antigravity", "skill for cursor", "cross-platform skill", "convert to codex", "convert to gemini", "convert to antigravity", or "convert to cursor".
skill-forge-build
Scaffold and build Claude Code skills from plans or descriptions. Generates SKILL.md files, sub-skills, scripts, references, agents, and templates following the Agent Skills standard. Use when user says "build skill", "scaffold skill", "generate skill", "create SKILL.md", or "implement skill".
skill-forge
Ultimate Claude Code skill creator and architect. Designs, scaffolds, builds, reviews, evolves, and publishes production-grade Claude Code skills following the Agent Skills open standard and 3-layer architecture (directive, orchestration, execution). Handles single-file skills, multi-skill orchestrators with sub-skills and subagents, MCP-enhanced workflows, and full skill ecosystems. Industry detection for skill domain. Triggers on: "create skill", "build skill", "new skill", "skill creator", "skill builder", "skill-forge", "design skill", "scaffold skill", "review skill", "improve skill", "publish skill", "skill architecture", "convert skill", "port skill", "multi-platform", "cross-platform", "eval skill", "test skill", "benchmark skill", "skill evals", "measure skill", "skill performance", "skill A/B test".
Compensation & Salary Benchmarking Planner
Build data-driven compensation structures that attract talent without overpaying. Covers base salary bands, equity/bonus frameworks, geographic differentials, and total rewards packaging.
ml-model-eval-benchmark
Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.
benchmark
Performance regression detection using the browse daemon. Establishes baselines for page load times, Core Web Vitals, and resource sizes. Compares before/after on every PR. Tracks performance trends over time. Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals", "bundle size", "load time". (gstack) Voice triggers (speech-to-text aliases): "speed test", "check performance".
plugin-forge
Create and manage Claude Code plugins with proper structure, manifests, and marketplace integration. Use when creating plugins for a marketplace, adding plugin components (commands, agents, hooks), bumping plugin versions, or working with plugin.json/marketplace.json manifests.