toolkit-evolution

Closed-loop toolkit self-improvement: diagnose, propose, critique, build, test, evolve.

290 stars

Best use case

toolkit-evolution is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Closed-loop toolkit self-improvement: diagnose, propose, critique, build, test, evolve.

Teams using toolkit-evolution should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/toolkit-evolution/SKILL.md --create-dirs "https://raw.githubusercontent.com/notque/claude-code-toolkit/main/skills/toolkit-evolution/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/toolkit-evolution/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How toolkit-evolution Compares

Feature / Agenttoolkit-evolutionStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Closed-loop toolkit self-improvement: diagnose, propose, critique, build, test, evolve.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Toolkit Evolution

Schedulable (nightly) or manually-invoked pipeline that drives continuous improvement of the toolkit itself. Chains existing skills into a full closed-loop improvement cycle: diagnose problems from evidence, propose solutions, critique them from multiple perspectives, build the winners, A/B test against baselines, and promote winners via PR.

This is the nightly sibling of `auto-dream`. Auto-dream (2:07 AM) consolidates memories, graduates learnings, and prunes stale data. Toolkit-evolution (3:07 AM) diagnoses gaps, proposes features, builds and tests improvements. They feed each other: dream's graduated learnings inform evolution's diagnosis; evolution's results become dream's input for consolidation.

## When to invoke

- User says "evolve toolkit", "improve the system", "self-improve", "what should we improve"
- Cron job weekly (Sunday 3 AM) via wrapper script
- Manual trigger with optional focus area: `/evolve routing`, `/evolve hooks`

## Instructions

### Phase 1: DIAGNOSE -- Find improvement opportunities

**Goal**: Identify 5-10 evidence-backed improvement opportunities from multiple data sources.

**Step 1: Query the learning database for recent failures and routing mismatches**

```bash
python3 ~/.claude/scripts/learning-db.py search "routing decision mismatch reroute" --min-confidence 0.3 --limit 20
python3 ~/.claude/scripts/learning-db.py search "error pattern failure bug" --min-confidence 0.3 --limit 20
python3 ~/.claude/scripts/learning-db.py search "skill gap missing improvement" --min-confidence 0.3 --limit 20
```

Look for: recurring failures, routing mismatches where the user had to reroute, skills that consistently underperform, error patterns without automated fixes.

**Step 2: Scan recent git history for patterns**

```bash
# Frequent fixes to same areas suggest chronic issues
git log --oneline --since="2 weeks ago" | head -40

# Files changed most frequently (churn = potential problems)
git log --since="2 weeks ago" --pretty=format: --name-only | sort | uniq -c | sort -rn | head -20
```

**Step 3: Check auto-dream reports for accumulated insights**

```bash
ls -t ~/.claude/state/dream-* 2>/dev/null | head -5
# Read the most recent dream report for synthesized insights
```

If dream reports exist, read the latest one -- it contains cross-session patterns and graduation candidates that may point to improvement opportunities.

**Step 4: Narrow by focus area (if provided)**

If the user specified a focus area (e.g., "routing", "hooks", "agents"), filter all findings to that domain. If no focus area, analyze broadly.

**Step 5: Compile opportunity list**

Output a numbered list of 5-10 improvement opportunities. Each entry must include:
- **What**: One-sentence description of the problem or gap
- **Evidence**: Which data source surfaced it (learning DB entry, git churn, dream report)
- **Impact**: Estimated user impact (High/Medium/Low)

**Gate**: At least 3 evidence-backed opportunities identified. If fewer than 3, expand the time window or broaden the data sources. Do not proceed with speculative opportunities that lack evidence.

---

### Phase 2: PROPOSE -- Generate concrete solutions

**Goal**: Transform opportunities into actionable proposals with clear scope.

**Step 1: Generate proposals**

For each opportunity from Phase 1, propose 1-2 concrete solutions. Each proposal must be actionable:
- "Add anti-pattern X to agent Y's prompt" (not "improve agent Y")
- "Create a reference file for Z in skill W" (not "enhance skill W")
- "Modify Phase 3 of skill V to include check for Q" (not "make skill V better")

**Step 2: Estimate effort**

| Effort | Definition |
|--------|-----------|
| Small | Single file edit, <30 lines changed |
| Medium | 2-5 files, new reference or script, <200 lines |
| Large | New skill or agent, multiple components, >200 lines |

**Step 3: Check for duplicates**

```bash
# Verify proposals don't duplicate existing capabilities
cat skills/INDEX.json | python3 -c "import sys,json; idx=json.load(sys.stdin); [print(k,'-',v.get('description','')) for k,v in idx.get('skills',{}).items()]" 2>/dev/null || echo "INDEX.json parse failed -- check manually"
```

Drop any proposal that duplicates an existing skill or capability. If an existing skill could be extended instead, frame the proposal as an extension.

**Step 4: Rank proposals**

Rank by: (Impact score) x (1 / Effort score), where High=3, Medium=2, Low=1 and Small=1, Medium=2, Large=3. Higher rank = better return on investment.

Output: ranked list of 5-10 proposals, each with:
- **Proposal**: 2-4 sentence description
- **Scope**: What files/skills are affected
- **Effort**: Small/Medium/Large
- **Expected outcome**: What measurably improves

**Gate**: All proposals are concrete (specific files/skills named), non-duplicative (verified against INDEX.json), and ranked. Proceed with the top 5.

---

### Phase 3: CRITIQUE -- Multi-persona evaluation

**Goal**: Evaluate proposals from multiple perspectives to surface blind spots.

**Step 1: Check for multi-persona-critique skill**

```bash
test -f skills/multi-persona-critique/SKILL.md && echo "AVAILABLE" || echo "NOT AVAILABLE"
```

**Step 2a: If multi-persona-critique is available**

Invoke it with the ranked proposals:

```
Skill(skill="multi-persona-critique", args="Evaluate these toolkit improvement proposals: {proposals}")
```

Collect consensus ratings and proceed to Step 3.

**Step 2b: If multi-persona-critique is NOT available -- use inline fallback**

Dispatch 3 parallel agents, each with a distinct evaluator perspective:

**Pragmatist**: "You are a pragmatist engineer. Evaluate each proposal on: implementation feasibility, time-to-value, and risk of unintended side effects. Rate each STRONG/MODERATE/WEAK with one sentence of justification."

**Purist**: "You are an architecture purist. Evaluate each proposal on: design coherence with existing toolkit patterns, long-term maintainability, and whether it solves a root cause vs a symptom. Rate each STRONG/MODERATE/WEAK with one sentence of justification."

**User Advocate**: "You are a user advocate. Evaluate each proposal on: how often real users hit this problem, whether the solution reduces friction, and whether it introduces new complexity users must learn. Rate each STRONG/MODERATE/WEAK with one sentence of justification."

**Step 3: Synthesize consensus**

For each proposal, compute a consensus score:
- STRONG = 3, MODERATE = 2, WEAK = 1
- Average across personas
- Score >= 2.5 = STRONG consensus
- Score 1.5-2.4 = MODERATE consensus
- Score < 1.5 = WEAK consensus (shelve)

Output: ranked proposals with consensus scores, key concerns from each persona, and notable disagreements.

**Gate**: All personas have reported. Synthesis complete. At least 1 proposal rated STRONG. If no STRONG proposals, revisit Phase 2 with the critique feedback, or report to user that no high-confidence improvements were found this cycle.

---

### Phase 4: BUILD -- Implement winners

**Goal**: Implement the top 1-3 STRONG-rated proposals on isolated feature branches.

**Constraint**: Maximum 3 implementations per cycle. Focus over breadth -- doing 3 well beats doing 7 poorly.

**Step 1: Select winners**

Take the top 1-3 proposals rated STRONG by consensus. If fewer than 3 are STRONG, take only the STRONG ones -- do not pad with MODERATE proposals.

**Step 2: Dispatch implementation agents**

For each winner, dispatch an appropriate agent in an isolated context:

| Proposal type | Implementation approach |
|--------------|----------------------|
| New skill | Use skill-creator methodology: draft SKILL.md, create references, structure directory |
| Skill modification | Read the target skill, apply the specific change, validate structure |
| New hook | Create hook script, register in settings.json (deploy hook files BEFORE registering) |
| Routing change | Update routing tables, verify with routing-table-updater |
| New reference file | Write the reference, add pointer in the parent skill's SKILL.md |
| Agent modification | Edit agent prompt, preserve frontmatter and routing metadata |

Each implementation must:
- Create a feature branch: `feat/evolve-{proposal-slug}`
- Make the minimal set of changes described in the proposal
- Commit with a descriptive message explaining what and why

**Step 3: Validate implementations**

For each implementation:
```bash
# Verify the skill structure if a skill was created/modified
python3 -m scripts.skill_eval.quick_validate skills/{skill-name} 2>/dev/null || echo "Validation script not available -- manual review"

# Verify no syntax errors in Python scripts
python3 -m py_compile {script} 2>/dev/null

# Verify shell scripts
bash -n {script} 2>/dev/null
```

**Gate**: All implementations committed on feature branches. Basic validation passed. Proceed to testing.

---

### Phase 5: VALIDATE -- A/B test implementations

**Goal**: Empirically verify that each implementation improves outcomes vs baseline.

**Step 1: Create test cases**

For each implementation, create 3-5 realistic test prompts that exercise the changed behavior. These should be the kind of input that would trigger the relevant skill or agent in production.

**Step 2: Run comparisons**

Use the skill-eval or agent-comparison methodology:
- **Baseline**: Run test prompts against the current (unmodified) toolkit
- **Candidate**: Run test prompts against the branch with the implementation

If skill-eval's evaluation modes are available:
```bash
python3 -m scripts.skill_eval.run_eval \
  --eval-set test-cases.json \
  --skill-path skills/{skill-name} \
  --runs-per-query 3 \
  --verbose
```

If automated comparison is not available, fall back to:
- Run each test prompt manually with and without the change
- Use a grader agent to score both outputs on relevant dimensions (correctness, completeness, actionability)

**Step 3: Evaluate results**

Win condition for each implementation:
- 60%+ of test cases show improvement on at least one dimension
- No dimension regressed by more than 1 point (on a 5-point scale)
- No new failures introduced

Mark each implementation as WIN or LOSS.

**Gate**: All implementations tested. Win/loss determined for each. Evidence recorded.

---

### Phase 6: EVOLVE -- Promote winners and record learnings

**Goal**: Ship winners via PR, record all outcomes in the learning database.

**Step 1: Handle winners (WIN status)**

For each winning implementation, create a PR, run pr-review, and merge:

```bash
git push -u origin feat/evolve-{proposal-slug}
gh pr create \
  --title "feat: {short description of improvement}" \
  --body "## Summary
- Evolution cycle proposal: {proposal description}
- Consensus score: {score} (Pragmatist: {rating}, Purist: {rating}, User Advocate: {rating})
- A/B result: {win rate}% improvement across {N} test cases

## Changes
{list of specific changes}

## Test Results
| Test Case | Baseline | Candidate | Delta |
|-----------|----------|-----------|-------|
| ... | ... | ... | ... |

## Evolution Cycle
This PR was generated and validated by the toolkit-evolution skill."
```

After creating the PR, run pr-review to validate, then merge:

```bash
# Review the PR (catches issues the A/B test may have missed)
# Use the pr-workflow skill's review capability
gh pr merge {pr-number} --squash --delete-branch
```

The multi-persona critique + A/B testing gate is the review. If a proposal passed both with STRONG consensus and WIN status, it has been validated more rigorously than most human reviews. Auto-merge is safe because the validation happened before this step, not after.

**Step 2: Handle losers (LOSS status)**

Record what was tried and why it failed:

```bash
python3 ~/.claude/scripts/learning-db.py learn \
  --topic "evolution-result" \
  "Failed proposal: {description}. Hypothesis: {what we expected}. Result: {what happened}. Lesson: {what we learned}."
```

Failed experiments are valuable data -- they prevent the same idea from being re-proposed in future cycles.

**Step 3: Record the full cycle**

```bash
python3 ~/.claude/scripts/learning-db.py learn \
  --topic "evolution-cycle" \
  "toolkit-evolution cycle: {N} proposals evaluated, {M} built, {W} winners, {L} losses. Top win: {description}. Focus: {area or 'general'}."
```

**Step 4: Write evolution report**

Write a dated report using the template in `references/evolution-report-template.md`:

```bash
# Write to project-local evolution-reports directory (gitignored)
# Path: evolution-reports/evolution-report-{YYYY-MM-DD}.md
```

Read the template, fill in all sections with data from this cycle, and write the report.

**Gate**: Winners merged. Learnings recorded for all proposals (wins and losses). Evolution report written. Cycle complete.

---

## Scheduling

### Manual invocation

```
/evolve
/evolve routing
/do evolve toolkit
/do evolve hooks
```

### Cron setup (nightly)

Runs nightly at 3:07 AM, after auto-dream (2:07 AM) finishes consolidating learnings:

```bash
python3 ~/.claude/scripts/crontab-manager.py add \
  --tag "toolkit-evolution" \
  --schedule "7 3 * * *" \
  --command "/home/feedgen/claude-code-toolkit/scripts/toolkit-evolution-cron.sh --execute >> /home/feedgen/claude-code-toolkit/cron-logs/toolkit-evolution/cron.log 2>&1"
```

Schedule uses 3:07 AM (off-minute per cron best practice, 1 hour after auto-dream). Budget set to $5.00 per run.

---

## Anti-Patterns

- **Improving without measuring** -- every change must have a baseline and A/B result. "It looks better" is not evidence.
- **Merging without validation** -- every winner must pass multi-persona critique (STRONG consensus) AND A/B testing (WIN status) before merge. The validation gates are the review.
- **Ignoring negative results** -- failed experiments are valuable data. Record them in the learning DB so the same idea is not re-proposed.
- **Improving everything at once** -- max 3 implementations per cycle. Focus compounds; scatter dissipates.
- **Running without diagnosis** -- do not propose solutions without evidence of problems first. Solutions looking for problems create phantom work.
- **Proposing duplicates** -- always check INDEX.json before proposing a new skill or capability. Extend existing skills when possible.

---

## Error Handling

### Error: "learning-db.py not found"
Cause: Learning database scripts not installed.
Solution: The learning DB is at `~/.claude/scripts/learning-db.py`. If missing, skip learning DB queries and rely on git log + dream reports for diagnosis. Record this gap as an improvement opportunity.

### Error: "No dream reports found"
Cause: Auto-dream has not run yet or state directory is empty.
Solution: Proceed without dream insights. Use git log and learning DB as primary data sources. Note that enabling auto-dream would improve future evolution cycles.

### Error: "No STRONG proposals after critique"
Cause: All proposals received MODERATE or WEAK consensus.
Solution: Report to the user that no high-confidence improvements were found this cycle. Record the proposals and critique feedback in the evolution report for future reference.

### Error: "A/B test inconclusive"
Cause: Test cases don't discriminate between baseline and candidate.
Solution: Review test case quality. Non-discriminating tests ("file exists") provide false signal. Write tests that exercise the specific behavior the proposal changes. If still inconclusive after better tests, shelve the proposal.

### Error: "Feature branch conflicts"
Cause: Multiple evolution implementations touch the same files.
Solution: Reduce to 1 implementation per cycle when conflicts arise. Alternatively, sequence implementations so later ones branch from earlier ones.

---

## Cost Estimate

A full evolution cycle runs all 6 phases and may dispatch multiple subagents. Estimated cost:
- Diagnosis + Proposal: ~$0.15 (reading files, querying DBs)
- Critique: ~$0.30 (3 persona agents evaluating proposals)
- Build: ~$0.50-1.50 (1-3 implementation agents)
- Validate: ~$0.50-1.50 (A/B test runs)
- Evolve: ~$0.10 (PR creation, learning DB writes)

Total: ~$1.50-3.50 per cycle. Budget capped at $5.00 via wrapper script.
Nightly cost at full utilization: ~$45-105/month. Cycles with no STRONG proposals exit early (diagnosis + proposal only: ~$0.45).

---

## References

- `references/evolution-report-template.md` -- Template for the evolution report
- `skills/auto-dream/SKILL.md` -- Nightly sibling: memory consolidation and learning graduation
- `skills/skill-eval/SKILL.md` -- Skill testing and benchmarking
- `skills/multi-persona-critique/SKILL.md` -- Multi-persona evaluation (may not exist yet; inline fallback provided)
- `skills/skill-creator/SKILL.md` -- Skill creation methodology
- `skills/agent-comparison/SKILL.md` -- A/B testing methodology
- `skills/headless-cron-creator/SKILL.md` -- Cron job creation patterns

Related Skills

x-api

290
from notque/claude-code-toolkit

Post tweets, build threads, upload media via the X API.

worktree-agent

290
from notque/claude-code-toolkit

Mandatory rules for agents in git worktree isolation.

workflow

290
from notque/claude-code-toolkit

Structured multi-phase workflows: review, debug, refactor, deploy, create, research, and more.

workflow-help

290
from notque/claude-code-toolkit

Interactive guide to workflow system: agents, skills, routing, execution patterns.

wordpress-uploader

290
from notque/claude-code-toolkit

WordPress REST API integration for posts and media uploads.

wordpress-live-validation

290
from notque/claude-code-toolkit

Validate published WordPress posts in browser via Playwright.

with-anti-rationalization

290
from notque/claude-code-toolkit

Anti-rationalization enforcement for maximum-rigor task execution.

voice-writer

290
from notque/claude-code-toolkit

Unified voice content generation pipeline with mandatory validation and joy-check. 8-phase pipeline: LOAD, GROUND, GENERATE, VALIDATE, REFINE, JOY-CHECK, OUTPUT, CLEANUP. Use when writing articles, blog posts, or any content that uses a voice profile. Use for "write article", "blog post", "write in voice", "generate content", "draft article", "write about".

voice-validator

290
from notque/claude-code-toolkit

Critique-and-rewrite loop for voice fidelity validation.

vitest-runner

290
from notque/claude-code-toolkit

Run Vitest tests and parse results into actionable output.

video-editing

290
from notque/claude-code-toolkit

Video editing pipeline: cut footage, assemble clips via FFmpeg and Remotion.

verification-before-completion

290
from notque/claude-code-toolkit

Defense-in-depth verification before declaring any task complete.