agent-improvement
Self-improvement loop for multi-agent workflows. Diagnose failures, improve tool descriptions, and learn from success/failure patterns.
Best use case
agent-improvement is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Self-improvement loop for multi-agent workflows. Diagnose failures, improve tool descriptions, and learn from success/failure patterns.
Teams using agent-improvement should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/agent-improvement/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How agent-improvement Compares
| Feature / Agent | agent-improvement | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Self-improvement loop for multi-agent workflows. Diagnose failures, improve tool descriptions, and learn from success/failure patterns.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Agent Self-Improvement
## Purpose
Enable continuous improvement of multi-agent workflows through:
- Failure pattern analysis
- Tool description optimization
- Success pattern recognition
- Performance benchmarking
**Reference**: Anthropic achieved 40% faster task completion through LLM-based tool description improvements.
## Improvement Cycle
```
┌─────────────────────────────────────────────────┐
│ │
│ 1. COLLECT │
│ └── Gather traces from completed sessions │
│ │
│ 2. ANALYZE │
│ └── Identify failure patterns & bottlenecks │
│ │
│ 3. DIAGNOSE │
│ └── Use LLM to understand root causes │
│ │
│ 4. IMPROVE │
│ └── Update tool descriptions & agent prompts │
│ │
│ 5. VALIDATE │
│ └── Test improvements on similar tasks │
│ │
│ 6. DEPLOY │
│ └── Roll out to all agents │
│ │
└─────────────────────────────────────────────────┘
```
## Data Collection
### Success/Failure Patterns
Store in `.temp/improvement/patterns/`:
```json
{
"pattern_id": "pat_001",
"type": "failure|success",
"frequency": 5,
"context": {
"task_type": "ui_component_creation",
"agent": "mobile-ui-specialist",
"phase": "implementation"
},
"description": "Agent often misses accessibility labels",
"examples": [
{
"session_id": "sess_abc",
"file": "StationCard.tsx",
"issue": "Missing accessibilityLabel on TouchableOpacity"
}
],
"proposed_fix": "Add explicit reminder in agent prompt",
"status": "identified|proposed|implemented|validated"
}
```
### Tool Usage Patterns
```json
{
"tool": "read",
"usage_count": 1523,
"success_rate": 0.98,
"avg_duration_ms": 45,
"common_errors": [
{
"error": "File not found",
"frequency": 23,
"cause": "Path alias not resolved"
}
],
"improvement_opportunities": [
"Add path alias resolution hint to tool description"
]
}
```
## Analysis Operations
### 1. Failure Analysis
**Input**: Session traces with failures
**Output**: Categorized failure patterns
```markdown
## Failure Analysis Report
### Category 1: Agent Boundary Violations
- Frequency: 12 occurrences
- Pattern: UI agent attempting to modify services
- Root Cause: Task boundaries not clear in delegation
- Fix: Add explicit "DO NOT" list to delegation template
### Category 2: Missing Dependencies
- Frequency: 8 occurrences
- Pattern: UI agent starts before types available
- Root Cause: Dependency order not enforced
- Fix: Add dependency check before spawning
### Category 3: Tool Misuse
- Frequency: 5 occurrences
- Pattern: Using grep instead of read for known files
- Root Cause: Tool descriptions don't clarify when to use each
- Fix: Update tool descriptions with decision criteria
```
### 2. Bottleneck Analysis
**Input**: Session metrics
**Output**: Performance bottlenecks
```markdown
## Bottleneck Analysis
### Bottleneck 1: Sequential Agent Spawning
- Impact: 40% time overhead
- Pattern: Agents spawned one at a time
- Fix: Spawn independent agents in parallel
### Bottleneck 2: Excessive Iterations
- Impact: 2x token usage
- Pattern: Average 3.2 iterations per task
- Fix: Improve initial task decomposition
### Bottleneck 3: Quality Gate Failures
- Impact: 25% rework
- Pattern: TypeScript errors on first integration
- Fix: Add pre-integration type check
```
## Improvement Actions
### Tool Description Updates
**Before:**
```
Read: Reads a file from the filesystem
```
**After:**
```
Read: Reads a file from the filesystem.
- Use when you know the exact file path
- Prefer over grep for reading specific known files
- Use path aliases (@components, @services)
- Returns line-numbered content
```
### Agent Prompt Updates
**Before:**
```
You are a mobile UI specialist...
```
**After:**
```
You are a mobile UI specialist...
CRITICAL REMINDERS:
- Always add accessibilityLabel to interactive elements
- Use memo() for components with complex props
- Check LINE_COLORS constant for subway line colors
```
### Delegation Template Updates
**Before:**
```
### Task Boundaries
- DO NOT modify services
```
**After:**
```
### Task Boundaries (EXPLICIT)
Files you CAN modify:
- src/components/**
- src/screens/**
Files you CANNOT modify:
- src/services/** (backend agent)
- src/models/** (shared types)
- **/__tests__/** (test agent)
STOP if you need to modify excluded files.
```
## Validation Protocol
### Before Deployment
1. **Identify test cases**
- Find similar past tasks
- Create synthetic test scenarios
2. **Run A/B comparison**
- Original prompts vs improved prompts
- Measure: success rate, iterations, tokens, time
3. **Quality threshold**
- Must improve at least one metric
- Must not regress any metric by >5%
### Validation Report
```markdown
## Improvement Validation
### Change: Added accessibility reminder to mobile-ui-specialist
### Test Results
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Accessibility issues | 12% | 2% | -83% |
| Success rate | 88% | 96% | +9% |
| Token usage | 45K | 47K | +4% |
### Verdict: APPROVE
Accessibility issues reduced significantly with minimal token overhead.
```
## Storage Structure
```
.temp/improvement/
├── patterns/
│ ├── failures/
│ │ └── pat_{id}.json
│ └── successes/
│ └── pat_{id}.json
├── proposals/
│ └── prop_{id}.md
├── validations/
│ └── val_{id}.json
└── history/
└── {date}/
└── changes.json
```
## Integration with Workflow
### Periodic Review (Weekly)
```markdown
1. Aggregate traces from past week
2. Run failure analysis
3. Generate improvement proposals
4. Prioritize by impact × frequency
5. Implement top 3 improvements
6. Validate before merge
```
### Continuous Learning (Per Session)
```markdown
1. After each session:
- If failed: Add to failure patterns
- If succeeded but slow: Add to bottleneck analysis
- If succeeded optimally: Add to success patterns
2. Check pattern thresholds:
- If failure pattern frequency > 5: Trigger improvement proposal
```
## Metrics to Track
### Agent Performance
| Metric | Target | Current | Trend |
|--------|--------|---------|-------|
| Success rate | >95% | 92% | ↑ |
| Avg iterations | <2 | 2.3 | → |
| Token efficiency | <80K | 75K | ↓ |
| Time to complete | <10min | 12min | ↑ |
### Improvement Impact
| Change | Implemented | Impact |
|--------|-------------|--------|
| Accessibility reminder | 2025-01-01 | -83% issues |
| Tool description update | 2025-01-02 | +5% success |
| Delegation template | 2025-01-03 | -20% iterations |
## Best Practices
### 1. Small, Targeted Changes
- One improvement at a time
- Clear before/after comparison
- Rollback plan ready
### 2. Data-Driven Decisions
- Require frequency > 5 before acting
- Validate with real tasks
- Measure actual impact
### 3. Preserve What Works
- Don't change successful patterns
- Document why changes were made
- Keep history for rollback
### 4. Human Review
- Major changes require approval
- Edge cases need human judgment
- Balance automation with oversight
---
## Quick Commands
```bash
# View failure patterns
cat .temp/improvement/patterns/failures/*.json | jq '.description'
# Count patterns by type
ls .temp/improvement/patterns/failures/ | wc -l
# View pending proposals
cat .temp/improvement/proposals/*.md
# Check improvement history
cat .temp/improvement/history/*/changes.json | jq '.'
```
---
**Version**: 1.0 | **Last Updated**: 2025-01-04Related Skills
continuous-improvement-focus
Emphasizes continuous improvement by suggesting process improvements and looking for opportunities to simplify and optimize code and workflows. This rule promotes a culture of ongoing refinement.
self-improvement
Zoe's self-improvement system - learns from corrections and user preferences
agent-ops-improvement-discovery
No description provided.
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
obsidian-daily
Manage Obsidian Daily Notes via obsidian-cli. Create and open daily notes, append entries (journals, logs, tasks, links), read past notes by date, and search vault content. Handles relative dates like "yesterday", "last Friday", "3 days ago".
obsidian-additions
Create supplementary materials attached to existing notes: experiments, meetings, reports, logs, conspectuses, practice sessions, annotations, AI outputs, links collections. Two-step process: (1) create aggregator space, (2) create concrete addition in base/additions/. INVOKE when user wants to attach any supplementary material to an existing note. Triggers: "addition", "create addition", "experiment", "meeting notes", "report", "conspectus", "log", "practice", "annotations", "links", "link collection", "аддишн", "конспект", "встреча", "отчёт", "эксперимент", "практика", "аннотации", "ссылки", "добавь к заметке".
observe
Query and manage Observe using the Observe CLI. Use when the user wants to run OPAL queries, list datasets, manage objects, or interact with their Observe tenant from the command line.
observability-review
AI agent that analyzes operational signals (metrics, logs, traces, alerts, SLO/SLI reports) from observability platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana, Elastic) and produces practical, risk-aware triage and recommendations. Use when reviewing system health, investigating performance issues, analyzing monitoring data, evaluating service reliability, or providing SRE analysis of operational metrics. Distinguishes between critical issues requiring action, items needing investigation, and informational observations requiring no action.
nvidia-nim
NVIDIA NIM inference microservices for deploying AI models with OpenAI-compatible APIs, self-hosted or cloud
numpy-string-ops
Vectorized string manipulation using the char module and modern string alternatives, including cleaning and search operations. Triggers: string operations, numpy.char, text cleaning, substring search.
nova-act-usability
AI-orchestrated usability testing using Amazon Nova Act. The agent generates personas, runs tests to collect raw data, interprets responses to determine goal achievement, and generates HTML reports. Tests real user workflows (booking, checkout, posting) with safety guardrails. Use when asked to "test website usability", "run usability test", "generate usability report", "evaluate user experience", "test checkout flow", "test booking process", or "analyze website UX".
notebook-writer
Create and document Jupyter notebooks for reproducible analyses