skill-forge-evolve
Improve and iterate on existing Claude Code skills based on usage feedback, test results, or changing requirements. Handles under/over-triggering fixes, instruction refinement, new sub-skill addition, and architecture evolution. Use when user says "improve skill", "fix skill", "skill not triggering", "skill triggers too much", "update skill", or "evolve skill".
Best use case
skill-forge-evolve is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Improve and iterate on existing Claude Code skills based on usage feedback, test results, or changing requirements. Handles under/over-triggering fixes, instruction refinement, new sub-skill addition, and architecture evolution. Use when user says "improve skill", "fix skill", "skill not triggering", "skill triggers too much", "update skill", or "evolve skill".
Teams using skill-forge-evolve should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/skill-forge-evolve/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How skill-forge-evolve Compares
| Feature / Agent | skill-forge-evolve | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Improve and iterate on existing Claude Code skills based on usage feedback, test results, or changing requirements. Handles under/over-triggering fixes, instruction refinement, new sub-skill addition, and architecture evolution. Use when user says "improve skill", "fix skill", "skill not triggering", "skill triggers too much", "update skill", or "evolve skill".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Skill Evolution & Improvement
## Process
### Step 1: Diagnose the Issue
Ask the user or analyze logs to identify the problem category:
**Category A: Triggering Issues**
- Under-triggering: Skill doesn't activate when it should
- Over-triggering: Skill activates when it shouldn't
- Mis-triggering: Wrong sub-skill activates
**Category B: Execution Issues**
- Incomplete workflows: Skill stops before finishing
- Incorrect output: Results don't match expectations
- Missing error handling: Failures not handled gracefully
- Performance: Too slow or too many token used
**Category C: Architecture Issues**
- Missing capability: New use case not covered
- Scale issues: Skill too large, needs decomposition
- Cross-reference problems: Links to non-existent files
**Category D: Quality Issues**
- Inconsistent results: Different outputs for same input
- Vague instructions: Claude interprets differently each time
- Missing examples: No concrete guidance
### Step 2: Apply Fix by Category
#### Fix: Under-Triggering
1. Read current description
2. Identify missing trigger phrases
3. Add domain keywords and paraphrases
4. Add file type mentions if relevant
5. Test with 5 queries that should now trigger
**Common causes:**
- Description too generic
- Missing common paraphrases
- Technical jargon without lay terms
**Fix template:**
```yaml
# Before (under-triggers)
description: Analyzes code quality
# After (specific triggers)
description: >
Static code analysis and quality assessment. Checks code style,
complexity, security vulnerabilities, and test coverage. Use when
user says "code review", "code quality", "lint", "static analysis",
"code smell", "code audit", or "check my code".
```
#### Fix: Over-Triggering
1. Read current description
2. Identify why unrelated queries trigger it
3. Add negative triggers ("Do NOT use for...")
4. Make description more specific
5. Test with 5 queries that should NOT trigger
**Fix template:**
```yaml
# Before (over-triggers)
description: Processes documents for review
# After (specific + negative triggers)
description: >
Processes PDF legal documents for contract clause extraction and
compliance review. Use for legal contracts, NDAs, terms of service.
Do NOT use for general document editing, formatting, or non-legal PDFs.
```
#### Fix: Execution Issues
1. Identify the failing step in the workflow
2. Add explicit validation gates between steps
3. Add error handling with clear recovery instructions
4. Add "If X fails, then Y" fallback paths
5. Consider adding a script for fragile operations
#### Fix: Quality Issues
1. Replace vague instructions with specific ones
2. Add concrete examples of expected input/output
3. Add explicit "do this, not that" comparisons
4. Add quality check steps before final output
5. Consider adding a validation script
### Step 3: Iteration Workspace Protocol
Use structured workspaces to track improvements across iterations:
```
eval-workspace/
iteration-1/ # First version
eval-0/with_skill/ # Eval results
eval-0/baseline/
benchmark.json # Aggregated metrics
benchmark.md # Human-readable report
feedback.json # User feedback
iteration-2/ # After first improvement
eval-0/with_skill/
eval-0/baseline/
benchmark.json
benchmark.md
feedback.json
```
**The iteration loop:**
1. Apply the fix to the skill
2. Run `/skill-forge eval <path>` into `iteration-<N+1>/`
3. Run `/skill-forge benchmark <path>` with `--previous iteration-<N>/`
4. Review benchmark comparison for regressions
5. Collect user feedback into `feedback.json`
6. Read feedback and iterate (back to Step 2)
**Stop iterating when:**
- User says they're happy
- All feedback is empty (everything looks good)
- Benchmark shows no meaningful progress between iterations
- Pass rate meets the defined thresholds
### Step 3b: Self-Annealing Loop
For quick fixes without full eval pipeline:
```
1. Apply the fix
2. Test with the original failing case
3. Test with 3 other cases (regression check)
4. If fix works:
-> Update the directive/SKILL.md
-> Document the learning in references or SKILL.md
5. If fix fails:
-> Diagnose why
-> Try alternative approach
-> Repeat
```
### Step 3c: Description Optimization Loop
For triggering issues (Category A), use the automated optimization loop:
1. Generate trigger eval set: `python scripts/generate_eval_set.py <path>`
2. Review and refine the eval set with the user
3. Run optimization: `python scripts/optimize_description.py <path> --eval-set evals.json`
4. Review the train/test split scores and improvement suggestions
5. Apply suggested description changes
6. Re-run optimization to measure improvement
7. Select the description with the highest **test score** (not train — avoids overfitting)
8. Iterate up to 5 times or until test score plateaus
### Step 4: Architecture Evolution
When a skill outgrows its tier:
**Tier 1 -> Tier 2** (needs scripts):
1. Identify the fragile/deterministic operation
2. Create script in `scripts/`
3. Update SKILL.md to reference the script
4. Test script independently
**Tier 2 -> Tier 3** (needs sub-skills):
1. Identify distinct workflows that can be separated
2. Extract each into its own `skills/{parent}-{child}/SKILL.md`
3. Update main SKILL.md with routing table
4. Move shared knowledge to `references/`
5. Update install.sh
**Tier 3 -> Tier 4** (needs agents):
1. Identify workflows that can run in parallel
2. Create agent definitions in `agents/`
3. Update the audit/full-analysis sub-skill to delegate to agents
4. Test parallel execution
### Step 5: Version Management
After evolution:
1. Update `metadata.version` in frontmatter (if present)
2. Add learning to SKILL.md or reference file
3. Update any affected cross-references
4. Re-run validation: `python scripts/validate_skill.py <path>`
5. Test full workflow end-to-end
## Common Evolution Patterns
### Pattern: Adding Industry Detection
When a skill needs to adapt behavior by user type:
```markdown
## Industry Detection
Detect user type from context:
- **Type A**: [signals] -> [behavior]
- **Type B**: [signals] -> [behavior]
```
### Pattern: Adding Quality Gates
When output quality is inconsistent:
```markdown
## Quality Gates
Before delivering output:
- [ ] [Check 1]
- [ ] [Check 2]
- [ ] [Check 3]
```
### Pattern: Adding Scoring
When users need measurable output:
```markdown
## Scoring (0-100)
| Category | Weight |
|----------|--------|
| Category A | 30% |
| Category B | 30% |
| Category C | 20% |
| Category D | 20% |
```Related Skills
skill-forge-review
Audit and validate existing Claude Code skills for quality, triggering accuracy, structure compliance, and best practices. Scores skills on a 0-100 scale and provides prioritized improvement recommendations. Use when user says "review skill", "audit skill", "check skill", "validate skill", or "skill quality".
skill-forge-publish
Package and distribute Claude Code skills for sharing via GitHub, Claude.ai uploads, or team deployment. Creates install scripts, documentation, and .skill packages. Use when user says "publish skill", "share skill", "package skill", "distribute skill", or "release skill".
skill-forge-plan
Architecture and design planning for new Claude Code skills. Guides through use case definition, complexity tier selection, sub-skill decomposition, and file structure planning. Use when user says "plan skill", "design skill", "skill architecture", or "skill planning".
skill-forge-eval
Run evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns executor, grader, comparator, and analyzer sub-agents for parallel evaluation. Generates eval_metadata.json, grading.json, and feedback reports. Use when user says "eval skill", "test skill", "run evals", "evaluate skill", "skill evals", "test skill quality", "run skill tests", or "skill evaluation".
skill-forge-convert
Convert Claude Code skills to work on OpenAI Codex, Google Gemini CLI, Google Antigravity, and Cursor. Analyzes platform-specific features, generates target files (openai.yaml, AGENTS.md, GEMINI.md, .mdc rules), adapts frontmatter, converts MCP config, and produces compatibility reports. Use when user says "convert skill", "port skill", "multi-platform", "skill for codex", "skill for gemini", "skill for antigravity", "skill for cursor", "cross-platform skill", "convert to codex", "convert to gemini", "convert to antigravity", or "convert to cursor".
skill-forge-build
Scaffold and build Claude Code skills from plans or descriptions. Generates SKILL.md files, sub-skills, scripts, references, agents, and templates following the Agent Skills standard. Use when user says "build skill", "scaffold skill", "generate skill", "create SKILL.md", or "implement skill".
skill-forge-benchmark
Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".
skill-forge
Ultimate Claude Code skill creator and architect. Designs, scaffolds, builds, reviews, evolves, and publishes production-grade Claude Code skills following the Agent Skills open standard and 3-layer architecture (directive, orchestration, execution). Handles single-file skills, multi-skill orchestrators with sub-skills and subagents, MCP-enhanced workflows, and full skill ecosystems. Industry detection for skill domain. Triggers on: "create skill", "build skill", "new skill", "skill creator", "skill builder", "skill-forge", "design skill", "scaffold skill", "review skill", "improve skill", "publish skill", "skill architecture", "convert skill", "port skill", "multi-platform", "cross-platform", "eval skill", "test skill", "benchmark skill", "skill evals", "measure skill", "skill performance", "skill A/B test".
evolve
Evolve SDK development for TypeScript and Python. Use when building applications with Evolve to run AI agents (Claude, Codex, Gemini, Qwen, Kimi, OpenCode) in secure sandboxes. Triggers: (1) Creating Evolve applications, (2) Configuring agents with skills, Composio, MCP servers, (3) Using Swarm abstractions (map, filter, reduce, bestOf/best_of, verify), (4) Building Pipelines, (5) Structured output with schemas, (6) Session management, streaming, observability, (7) Checkpointing, storage & StorageClient, (8) Cost tracking (per-run and per-session spend), (9) Historical sessions & trace download via sessions() client.
plugin-forge
Create and manage Claude Code plugins with proper structure, manifests, and marketplace integration. Use when creating plugins for a marketplace, adding plugin components (commands, agents, hooks), bumping plugin versions, or working with plugin.json/marketplace.json manifests.
torchforge-rl-training
Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.
exploiting-server-side-request-forgery
Identifying and exploiting SSRF vulnerabilities to access internal services, cloud metadata, and restricted network resources during authorized penetration tests.