agent-orchestration-improve-agent
Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.
Best use case
agent-orchestration-improve-agent is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.
Teams using agent-orchestration-improve-agent should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/agent-orchestration-improve-agent/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How agent-orchestration-improve-agent Compares
| Feature / Agent | agent-orchestration-improve-agent | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Agent Performance Optimization Workflow Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration. [Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.] ## Use this skill when - Improving an existing agent's performance or reliability - Analyzing failure modes, prompt quality, or tool usage - Running structured A/B tests or evaluation suites - Designing iterative optimization workflows for agents ## Do not use this skill when - You are building a brand-new agent from scratch - There are no metrics, feedback, or test cases available - The task is unrelated to agent performance or prompt quality ## Instructions 1. Establish baseline metrics and collect representative examples. 2. Identify failure modes and prioritize high-impact fixes. 3. Apply prompt and workflow improvements with measurable goals. 4. Validate with tests and roll out changes in controlled stages. ## Safety - Avoid deploying prompt changes without regression testing. - Roll back quickly if quality or safety metrics regress. ## Phase 1: Performance Analysis and Baseline Metrics Comprehensive analysis of agent performance using context-manager for historical data collection. ### 1.1 Gather Performance Data ``` Use: context-manager Command: analyze-agent-performance $ARGUMENTS --days 30 ``` Collect metrics including: - Task completion rate (successful vs failed tasks) - Response accuracy and factual correctness - Tool usage efficiency (correct tools, call frequency) - Average response time and token consumption - User satisfaction indicators (corrections, retries) - Hallucination incidents and error patterns ### 1.2 User Feedback Pattern Analysis Identify recurring patterns in user interactions: - **Correction patterns**: Where users consistently modify outputs - **Clarification requests**: Common areas of ambiguity - **Task abandonment**: Points where users give up - **Follow-up questions**: Indicators of incomplete responses - **Positive feedback**: Successful patterns to preserve ### 1.3 Failure Mode Classification Categorize failures by root cause: - **Instruction misunderstanding**: Role or task confusion - **Output format errors**: Structure or formatting issues - **Context loss**: Long conversation degradation - **Tool misuse**: Incorrect or inefficient tool selection - **Constraint violations**: Safety or business rule breaches - **Edge case handling**: Unusual input scenarios ### 1.4 Baseline Performance Report Generate quantitative baseline metrics: ``` Performance Baseline: - Task Success Rate: [X%] - Average Corrections per Task: [Y] - Tool Call Efficiency: [Z%] - User Satisfaction Score: [1-10] - Average Response Latency: [Xms] - Token Efficiency Ratio: [X:Y] ``` ## Phase 2: Prompt Engineering Improvements Apply advanced prompt optimization techniques using prompt-engineer agent. ### 2.1 Chain-of-Thought Enhancement Implement structured reasoning patterns: ``` Use: prompt-engineer Technique: chain-of-thought-optimization ``` - Add explicit reasoning steps: "Let's approach this step-by-step..." - Include self-verification checkpoints: "Before proceeding, verify that..." - Implement recursive decomposition for complex tasks - Add reasoning trace visibility for debugging ### 2.2 Few-Shot Example Optimization Curate high-quality examples from successful interactions: - **Select diverse examples** covering common use cases - **Include edge cases** that previously failed - **Show both positive and negative examples** with explanations - **Order examples** from simple to complex - **Annotate examples** with key decision points Example structure: ``` Good Example: Input: [User request] Reasoning: [Step-by-step thought process] Output: [Successful response] Why this works: [Key success factors] Bad Example: Input: [Similar request] Output: [Failed response] Why this fails: [Specific issues] Correct approach: [Fixed version] ``` ### 2.3 Role Definition Refinement Strengthen agent identity and capabilities: - **Core purpose**: Clear, single-sentence mission - **Expertise domains**: Specific knowledge areas - **Behavioral traits**: Personality and interaction style - **Tool proficiency**: Available tools and when to use them - **Constraints**: What the agent should NOT do - **Success criteria**: How to measure task completion ### 2.4 Constitutional AI Integration Implement self-correction mechanisms: ``` Constitutional Principles: 1. Verify factual accuracy before responding 2. Self-check for potential biases or harmful content 3. Validate output format matches requirements 4. Ensure response completeness 5. Maintain consistency with previous responses ``` Add critique-and-revise loops: - Initial response generation - Self-critique against principles - Automatic revision if issues detected - Final validation before output ### 2.5 Output Format Tuning Optimize response structure: - **Structured templates** for common tasks - **Dynamic formatting** based on complexity - **Progressive disclosure** for detailed information - **Markdown optimization** for readability - **Code block formatting** with syntax highlighting - **Table and list generation** for data presentation ## Phase 3: Testing and Validation Comprehensive testing framework with A/B comparison. ### 3.1 Test Suite Development Create representative test scenarios: ``` Test Categories: 1. Golden path scenarios (common successful cases) 2. Previously failed tasks (regression testing) 3. Edge cases and corner scenarios 4. Stress tests (complex, multi-step tasks) 5. Adversarial inputs (potential breaking points) 6. Cross-domain tasks (combining capabilities) ``` ### 3.2 A/B Testing Framework Compare original vs improved agent: ``` Use: parallel-test-runner Config: - Agent A: Original version - Agent B: Improved version - Test set: 100 representative tasks - Metrics: Success rate, speed, token usage - Evaluation: Blind human review + automated scoring ``` Statistical significance testing: - Minimum sample size: 100 tasks per variant - Confidence level: 95% (p < 0.05) - Effect size calculation (Cohen's d) - Power analysis for future tests ### 3.3 Evaluation Metrics Comprehensive scoring framework: **Task-Level Metrics:** - Completion rate (binary success/failure) - Correctness score (0-100% accuracy) - Efficiency score (steps taken vs optimal) - Tool usage appropriateness - Response relevance and completeness **Quality Metrics:** - Hallucination rate (factual errors per response) - Consistency score (alignment with previous responses) - Format compliance (matches specified structure) - Safety score (constraint adherence) - User satisfaction prediction **Performance Metrics:** - Response latency (time to first token) - Total generation time - Token consumption (input + output) - Cost per task (API usage fees) - Memory/context efficiency ### 3.4 Human Evaluation Protocol Structured human review process: - Blind evaluation (evaluators don't know version) - Standardized rubric with clear criteria - Multiple evaluators per sample (inter-rater reliability) - Qualitative feedback collection - Preference ranking (A vs B comparison) ## Phase 4: Version Control and Deployment Safe rollout with monitoring and rollback capabilities. ### 4.1 Version Management Systematic versioning strategy: ``` Version Format: agent-name-v[MAJOR].[MINOR].[PATCH] Example: customer-support-v2.3.1 MAJOR: Significant capability changes MINOR: Prompt improvements, new examples PATCH: Bug fixes, minor adjustments ``` Maintain version history: - Git-based prompt storage - Changelog with improvement details - Performance metrics per version - Rollback procedures documented ### 4.2 Staged Rollout Progressive deployment strategy: 1. **Alpha testing**: Internal team validation (5% traffic) 2. **Beta testing**: Selected users (20% traffic) 3. **Canary release**: Gradual increase (20% → 50% → 100%) 4. **Full deployment**: After success criteria met 5. **Monitoring period**: 7-day observation window ### 4.3 Rollback Procedures Quick recovery mechanism: ``` Rollback Triggers: - Success rate drops >10% from baseline - Critical errors increase >5% - User complaints spike - Cost per task increases >20% - Safety violations detected Rollback Process: 1. Detect issue via monitoring 2. Alert team immediately 3. Switch to previous stable version 4. Analyze root cause 5. Fix and re-test before retry ``` ### 4.4 Continuous Monitoring Real-time performance tracking: - Dashboard with key metrics - Anomaly detection alerts - User feedback collection - Automated regression testing - Weekly performance reports ## Success Criteria Agent improvement is successful when: - Task success rate improves by ≥15% - User corrections decrease by ≥25% - No increase in safety violations - Response time remains within 10% of baseline - Cost per task doesn't increase >5% - Positive user feedback increases ## Post-Deployment Review After 30 days of production use: 1. Analyze accumulated performance data 2. Compare against baseline and targets 3. Identify new improvement opportunities 4. Document lessons learned 5. Plan next optimization cycle ## Continuous Improvement Cycle Establish regular improvement cadence: - **Weekly**: Monitor metrics and collect feedback - **Monthly**: Analyze patterns and plan improvements - **Quarterly**: Major version updates with new capabilities - **Annually**: Strategic review and architecture updates Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.
Related Skills
continuous-improvement-focus
Emphasizes continuous improvement by suggesting process improvements and looking for opportunities to simplify and optimize code and workflows. This rule promotes a culture of ongoing refinement.
beads-orchestration
Multi-agent orchestration for GitHub Issues using BEADS task tracking
apache-airflow-orchestration
Complete guide for Apache Airflow orchestration including DAGs, operators, sensors, XComs, task dependencies, dynamic workflows, and production deployment
workflow-orchestration
Design and implement DAG-based workflows with parallel execution, retries, and error handling. Use when building complex multi-step agent workflows.
ptc-orchestration
Activate when user needs multi-URL scraping, browser automation pipelines, or efficient tool orchestration to reduce API round-trips and context usage.
self-improvement
Zoe's self-improvement system - learns from corrections and user preferences
sk-prompt-improver
Prompt engineering specialist that transforms vague requests into structured, scored AI prompts using 7 proven frameworks (RCAF, COSTAR, RACE, CIDI, TIDD-EC, CRISPE, CRAFT), DEPTH thinking methodology, and CLEAR scoring across text modes.
prompt-improver
Improve prompts for AI agents and Telegram bots using OpenAI's prompt engineering best practices. Analyzes clarity, specificity, context, and output format. Returns structured improvements.
claude-improve-config
Self-reflect on the current session to identify mistakes and propose improvements to .claude configuration (CLAUDE.md, hooks, skills).
ai-orchestration-feedback-loop
Multi-AI engineering loop orchestrating Claude, Codex, and Gemini for comprehensive validation. USE WHEN (1) mission-critical features requiring multi-perspective validation, (2) complex architectural decisions needing diverse AI viewpoints, (3) security-sensitive code requiring deep analysis, (4) user explicitly requests multi-AI review or triple-AI loop. DO NOT USE for simple features or single-file changes. MODES - Triple-AI (full coverage), Dual-AI Codex-Claude (security/logic), Dual-AI Gemini-Claude (UX/creativity).
agents-md-improver
Keeps repo-local agent instructions consistent by proposing updates to AGENTS.md when a user corrects the coding agent or asks to change AGENTS.md, CLAUDE.md, .claude/CLAUDE.md, or GEMINI.md.
agentic-orchestration
Patterns for multi-agent coordination, task decomposition, handoffs, and workflow orchestration. Best practices for building and managing agent systems.