devtu-self-evolve

Orchestrate the full ToolUniverse self-improvement cycle: discover APIs, create tools, test with researcher personas, fix issues, optimize skills, and push via git. References and dispatches to all other devtu skills. Use when asked to: run the self-improvement loop, do a debug/test round, expand tool coverage, improve tool quality, or evolve ToolUniverse.

1,202 stars

bymims-harvard

View on GitHub Installation ↓

Best use case

devtu-self-evolve is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using devtu-self-evolve should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/devtu-self-evolve/SKILL.md --create-dirs "https://raw.githubusercontent.com/mims-harvard/ToolUniverse/main/skills/devtu-self-evolve/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/devtu-self-evolve/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How devtu-self-evolve Compares

Feature / Agent	devtu-self-evolve	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

SKILL.md Source

# ToolUniverse Self-Evolution Orchestrator

Coordinates the full development lifecycle by dispatching to specialized devtu skills.

## The Cycle

```
Discover → Create → Test → Fix → Optimize → Ship → Repeat
```

Each phase maps to a dedicated skill:

| Phase | Skill | What it does |
|-------|-------|-------------|
| **Discover** | `devtu-auto-discover-apis` | Gap analysis, web search for APIs, batch discovery |
| **Create** | `devtu-create-tool` | Build tool class + JSON config + test examples |
| **Test** | *(this skill)* | Launch researcher persona agents to find issues |
| **Fix** | `devtu-fix-tool` | Diagnose failures, implement fixes, validate |
| **Optimize** | `devtu-optimize-skills` | Improve skill reports, evidence handling, UX |
| **Optimize** | `devtu-optimize-descriptions` | Improve tool JSON descriptions for clarity |
| **Docs** | `devtu-docs-quality` | Validate documentation accuracy |
| **Ship** | `devtu-github` | Branch, commit, push, create PR |

## Quick Start

Pick an entry point based on what's needed:

- **"Run a test round"** → jump to [Testing Phase](#testing-phase)
- **"Expand coverage"** → invoke `Skill(skill="devtu-auto-discover-apis")`
- **"Create a new tool"** → invoke `Skill(skill="devtu-create-tool")`
- **"Fix a broken tool"** → invoke `Skill(skill="devtu-fix-tool")`
- **"Improve skills"** → invoke `Skill(skill="devtu-optimize-skills")`
- **"Full cycle"** → follow all phases below in order

---

## Phase 1: Discovery (optional)

Invoke `Skill(skill="devtu-auto-discover-apis")` to:
1. Run gap analysis on current tool categories
2. Search for life science APIs in underrepresented domains
3. Score and prioritize APIs by coverage, reliability, documentation

## Phase 2: Tool Creation (optional)

Invoke `Skill(skill="devtu-create-tool")` for each new API:
1. Create Python tool class implementing the API
2. Create JSON config with parameters, descriptions, test examples
3. Register in `_lazy_registry_static.py` and `default_config.py`
4. Validate: `python -m tooluniverse.cli test <ToolName>`

## Phase 3: Testing Phase

This is the core testing loop, run directly by this skill.

### Setup

1. Check for open PRs: `gh pr list --state open`
2. If unmerged PR → use that branch; if merged → new branch from `origin/main`
3. Rebase: `git fetch origin && git rebase origin/main`

### Researcher Persona Agents

Launch 2 agents per round (A + B) using the Agent tool with these parameters:

**Each agent gets:**
- Domain specialty (oncology, genomics, pharmacology, etc.)
- Research question (specific biological question)
- 5-7 test scenarios exercising different tools
- Instructions to report issues with severity (HIGH/MEDIUM/LOW)
- Issue IDs: `Feature-{round}{letter}-{num}` (e.g., `Feature-59A-001`)

**Agent prompt template** — see [references/persona-template.md](references/persona-template.md)

### Verification (CRITICAL)

Before implementing ANY agent-reported issue, verify via CLI:
```bash
python3 -m tooluniverse.cli run <ToolName> '<json_args>'
```

**50%+ of agent reports are false positives** from MCP interface confusion. Only fix verified issues.

### Fix Principles

1. **Prevent, don't recover** — fix root cause, not symptoms
2. **Validate at input** — reject bad params early with clear guidance
3. **Distinguish "no data" from "bad query"** — different messages for each
4. **Fix the abstraction** — don't add alias lists that grow forever

Anti-patterns: hint text instead of validation, parameter aliases instead of fixing naming, post-hoc probing instead of pre-validation.

### Skill Usefulness Testing (NEW — beyond tool testing)

Standard testing verifies tools work. Usefulness testing verifies skills actually solve scientist problems. Run this after standard testing:

1. **Pick a real research question** that the skill claims to answer (not a tool-level test)
2. **Launch an agent** following the skill workflow on the real question
3. **Assess honestly**: Does the skill produce an actionable answer, or just a data dump?

**Score 1-10 rubric**:
- 1-3: Tool catalog — lists tools without interpretation
- 4-6: Data collector — gathers data but doesn't help combine/interpret
- 7-8: Reasoning framework — guides interpretation with tables/scoring/synthesis
- 9-10: Decision engine — produces concrete, defensible recommendations

**Common failure patterns found in usefulness tests**:

| Pattern | Score Impact | Fix |
|---------|-------------|-----|
| "Call A, then B, then C" without explaining what to DO with results | -3 | Add interpretation tables |
| Tool params wrong (tool works but skill documents wrong names) | -2 | Verify ALL tool params via `get_tool_info()` |
| Promises data the API can't deliver (e.g., DepMap CRISPR scores) | -2 | Be honest about limitations; add computational procedure workaround |
| No synthesis phase at the end | -2 | Add "so what?" phase that combines all evidence |
| No evidence grading | -1 | Add T1-T4 or similar confidence tiers |
| No computational procedures for things tools can't do | -1 | Add Python code blocks using scipy/pandas/numpy |

**When tools can't help, add computational procedures**: Some analyses need Python code, not API calls. Skills should include working code blocks for:
- Statistical testing (scipy.stats, FDR correction)
- Data analysis from downloaded files (pandas + CSV from DepMap, TCGA, etc.)
- Scoring algorithms (ACMG classification, viability scores)
- Sequence analysis (Biopython)

See `devtu-optimize-skills` Patterns 14-15 for full guidance.

## Phase 4: Fix & Commit

1. Implement verified fixes (see [references/bug-patterns.md](references/bug-patterns.md) for code-level patterns)
2. **Run code-simplifier**: `Skill(skill="simplify")` — always after writing or modifying code
3. Lint: `ruff check src/tooluniverse/<file>.py`
4. Verify syntax: `python -c "from tooluniverse.<module> import <Class>"`
5. Test: `python -m tooluniverse.cli run <Tool> '<json>'`
6. Pre-commit hook pattern: stage → commit (fails, reformats) → re-stage → commit
7. Push: `git push origin <branch>`

> Also see `Skill(skill="devtu-code-optimization")` for reusable fix patterns and anti-patterns.

## Phase 5: Optimize (optional)

After fixes are stable:
- `Skill(skill="devtu-optimize-descriptions")` — improve tool descriptions
- `Skill(skill="devtu-optimize-skills")` — improve research skill quality
- `Skill(skill="devtu-docs-quality")` — validate docs accuracy

## Phase 6: Ship

Invoke `Skill(skill="devtu-github")` or manually:
1. Rebase: `git fetch origin && git stash && git rebase origin/main && git stash pop`
2. `git push --force-with-lease origin <branch>`
3. Create or update PR: `gh pr create` / verify with `gh pr view <N> --json mergeable`
4. Verify `"mergeable": "MERGEABLE"` before reporting done

**GitHub repo**: `mims-harvard/ToolUniverse` — always verify with `git remote -v` before pushing.

---

## Git Rules (CRITICAL)

- **NEVER push to main** — all work on feature branches
- **NEVER have multiple open fix PRs** — keep adding to current branch
- **Always rebase before push**: `git fetch origin && git rebase origin/main`
- **Commit message format**: no "BUG" terminology, use "Feature" or "Fix"
- **No AI attribution** in commits

## Common Issue Categories

| Category | Signal |
|----------|--------|
| Silent parameter miss | Wrong-field check; param ignored |
| Always-fires conditional | `.get("field")` on wrong type |
| Silent normalization | Auto-transform not disclosed |
| Wrong notation/case | Gene fusions, Title Case names |
| Substring match | Short symbol returns multiple targets |
| try/except indent | Mismatched → SyntaxError |

Full patterns → [references/bug-patterns.md](references/bug-patterns.md)

## Round Tracking

After each round: advance counter, update patterns file, keep this SKILL.md under 150 lines.

**Current round: 127** (rounds completed: 52-126)

Related Skills

devtu-optimize-skills

1202

from mims-harvard/ToolUniverse

Optimize ToolUniverse skills for better report quality, evidence handling, and user experience. Apply patterns like tool verification, foundation data layers, disambiguation-first, evidence grading, quantified completeness, and report-only output. Use when reviewing skills, improving existing skills, or creating new ToolUniverse research skills.

devtu-optimize-descriptions

1202

from mims-harvard/ToolUniverse

Optimize tool descriptions in ToolUniverse JSON configs for clarity and usability. Reviews descriptions for missing prerequisites, unexpanded abbreviations, unclear parameters, and missing usage guidance. Use when reviewing tool descriptions, improving API documentation, or when user asks to check if tools are easy to understand.

devtu-github

1202

from mims-harvard/ToolUniverse

GitHub workflow for ToolUniverse - push code safely by moving temp files, activating pre-commit hooks, running tests, and cleaning staged files. Use when pushing to GitHub, fixing CI failures, or cleaning up before commits.

devtu-fix-tool

1202

from mims-harvard/ToolUniverse

Fix failing ToolUniverse tools by diagnosing test failures, identifying root causes, implementing fixes, and validating solutions. Use when ToolUniverse tools fail tests, return errors, have schema validation issues, or when asked to debug or fix tools in the ToolUniverse framework.

devtu-docs-quality

1202

from mims-harvard/ToolUniverse

TOP PRIORITY skill — find and immediately fix or remove every piece of wrong, outdated, or redundant information in ToolUniverse docs. Wrong code, broken links, incorrect counts, and overlapping instructions must be fixed or removed — never left in place. Runs five phases: (D) static method scan, (C) live code execution, (A) automated validation, (B) ToolUniverse audit, (E) less-is-more simplification. Core philosophy: each concept appears exactly once; remove don't add; no emojis; single setup entry point. Use when reviewing docs, before releases, after API changes, or when asked to audit, fix, or simplify documentation.

devtu-create-tool

1202

from mims-harvard/ToolUniverse

Create new scientific tools for ToolUniverse framework with proper structure, validation, and testing. Use when users need to add tools to ToolUniverse, implement new API integrations, create tool wrappers for scientific databases/services, expand ToolUniverse capabilities, or follow ToolUniverse contribution guidelines. Supports creating tool classes, JSON configurations, validation, error handling, and test examples.

devtu-code-optimization

1202

from mims-harvard/ToolUniverse

Code quality patterns and guidelines for ToolUniverse tool development. Apply when writing, fixing, or refactoring tool Python code in the ToolUniverse project. Encodes lessons from 80+ debug rounds. Use alongside devtu-fix-tool and devtu-self-evolve. Triggers: implementing tool fixes, writing new tool classes, reviewing tool code quality, checking schema correctness, looking up API-specific bug fixes.

devtu-auto-discover-apis

1202

from mims-harvard/ToolUniverse

Automatically discover life science APIs online, create ToolUniverse tools, validate them, and prepare integration PRs. Performs gap analysis to identify missing tool categories, web searches for APIs, automated tool creation using devtu-create-tool patterns, validation with devtu-fix-tool, and git workflow management. Use when expanding ToolUniverse coverage, adding new API integrations, or systematically discovering scientific resources.

tooluniverse

1202

from mims-harvard/ToolUniverse

Router skill for ToolUniverse tasks. First checks if specialized tooluniverse skills (105+ skills covering disease/drug/target research, gene-disease associations, clinical decision support, genomics, epigenomics, proteomics, comparative genomics, chemical safety, toxicology, systems biology, and more) can solve the problem, then falls back to general strategies for using 2300+ scientific tools. Covers tool discovery, multi-hop queries, comprehensive research workflows, disambiguation, evidence grading, and report generation. Use when users need to research any scientific topic, find biological data, or explore drug/target/disease relationships. ALSO USE for any biology, medicine, chemistry, pharmacology, or life science question — even simple factoid questions like "how many X in protein Y", "what drug interacts with Z", "what gene causes disease W", or "translate this sequence". These questions benefit from database lookups (UniProt, PubMed, ChEMBL, ClinVar, GWAS Catalog, etc.) rather than answering from memory alone. When in doubt about a scientific fact, USE THIS SKILL to verify against real databases.

tooluniverse-variant-to-mechanism

1202

from mims-harvard/ToolUniverse

End-to-end variant-to-mechanism analysis: given a genetic variant (rsID or coordinates), trace its functional impact from regulatory context (GWAS, eQTL, RegulomeDB, ENCODE) through target gene identification (GTEx, OpenTargets L2G) to downstream pathway and disease biology (STRING, Reactome, GO enrichment, disease associations). Produces an evidence-graded mechanistic narrative linking genotype to phenotype. Use when asked "how does this variant cause disease?", "what is the mechanism of rs7903146?", "trace variant to pathway", or "connect this GWAS hit to biology".

tooluniverse-variant-interpretation

1202

from mims-harvard/ToolUniverse

Systematic clinical variant interpretation from raw variant calls to ACMG-classified recommendations with structural impact analysis. Aggregates evidence from ClinVar, gnomAD, CIViC, UniProt, and PDB across ACMG criteria. Produces pathogenicity scores (0-100), clinical recommendations, and treatment implications. Use when interpreting genetic variants, classifying variants of uncertain significance (VUS), performing ACMG variant classification, or translating variant calls to clinical actionability.

tooluniverse-variant-functional-annotation

1202

from mims-harvard/ToolUniverse

Comprehensive functional annotation of protein variants — pathogenicity, population frequency, structural context, and clinical significance. Integrates ProtVar (map_variant, get_function, get_population) for protein-level mapping and structural context, ClinVar for clinical classifications, gnomAD for population frequency with ancestry data, CADD for deleteriousness scores, and ClinGen for gene-disease validity. Produces a structured variant annotation report with evidence grading. Use when asked about protein variant impact, missense variant pathogenicity, ProtVar annotation, variant functional context, or combining population and structural evidence for a variant.