agent-orchestration-improve-agent

Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.

16 stars

Best use case

agent-orchestration-improve-agent is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.

Teams using agent-orchestration-improve-agent should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/agent-orchestration-improve-agent/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/agent-orchestration-improve-agent/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/agent-orchestration-improve-agent/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How agent-orchestration-improve-agent Compares

Feature / Agentagent-orchestration-improve-agentStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Agent Performance Optimization Workflow

Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.

[Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.]

## Use this skill when

- Improving an existing agent's performance or reliability
- Analyzing failure modes, prompt quality, or tool usage
- Running structured A/B tests or evaluation suites
- Designing iterative optimization workflows for agents

## Do not use this skill when

- You are building a brand-new agent from scratch
- There are no metrics, feedback, or test cases available
- The task is unrelated to agent performance or prompt quality

## Instructions

1. Establish baseline metrics and collect representative examples.
2. Identify failure modes and prioritize high-impact fixes.
3. Apply prompt and workflow improvements with measurable goals.
4. Validate with tests and roll out changes in controlled stages.

## Safety

- Avoid deploying prompt changes without regression testing.
- Roll back quickly if quality or safety metrics regress.

## Phase 1: Performance Analysis and Baseline Metrics

Comprehensive analysis of agent performance using context-manager for historical data collection.

### 1.1 Gather Performance Data

```
Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30
```

Collect metrics including:

- Task completion rate (successful vs failed tasks)
- Response accuracy and factual correctness
- Tool usage efficiency (correct tools, call frequency)
- Average response time and token consumption
- User satisfaction indicators (corrections, retries)
- Hallucination incidents and error patterns

### 1.2 User Feedback Pattern Analysis

Identify recurring patterns in user interactions:

- **Correction patterns**: Where users consistently modify outputs
- **Clarification requests**: Common areas of ambiguity
- **Task abandonment**: Points where users give up
- **Follow-up questions**: Indicators of incomplete responses
- **Positive feedback**: Successful patterns to preserve

### 1.3 Failure Mode Classification

Categorize failures by root cause:

- **Instruction misunderstanding**: Role or task confusion
- **Output format errors**: Structure or formatting issues
- **Context loss**: Long conversation degradation
- **Tool misuse**: Incorrect or inefficient tool selection
- **Constraint violations**: Safety or business rule breaches
- **Edge case handling**: Unusual input scenarios

### 1.4 Baseline Performance Report

Generate quantitative baseline metrics:

```
Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]
```

## Phase 2: Prompt Engineering Improvements

Apply advanced prompt optimization techniques using prompt-engineer agent.

### 2.1 Chain-of-Thought Enhancement

Implement structured reasoning patterns:

```
Use: prompt-engineer
Technique: chain-of-thought-optimization
```

- Add explicit reasoning steps: "Let's approach this step-by-step..."
- Include self-verification checkpoints: "Before proceeding, verify that..."
- Implement recursive decomposition for complex tasks
- Add reasoning trace visibility for debugging

### 2.2 Few-Shot Example Optimization

Curate high-quality examples from successful interactions:

- **Select diverse examples** covering common use cases
- **Include edge cases** that previously failed
- **Show both positive and negative examples** with explanations
- **Order examples** from simple to complex
- **Annotate examples** with key decision points

Example structure:

```
Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]

Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]
```

### 2.3 Role Definition Refinement

Strengthen agent identity and capabilities:

- **Core purpose**: Clear, single-sentence mission
- **Expertise domains**: Specific knowledge areas
- **Behavioral traits**: Personality and interaction style
- **Tool proficiency**: Available tools and when to use them
- **Constraints**: What the agent should NOT do
- **Success criteria**: How to measure task completion

### 2.4 Constitutional AI Integration

Implement self-correction mechanisms:

```
Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responses
```

Add critique-and-revise loops:

- Initial response generation
- Self-critique against principles
- Automatic revision if issues detected
- Final validation before output

### 2.5 Output Format Tuning

Optimize response structure:

- **Structured templates** for common tasks
- **Dynamic formatting** based on complexity
- **Progressive disclosure** for detailed information
- **Markdown optimization** for readability
- **Code block formatting** with syntax highlighting
- **Table and list generation** for data presentation

## Phase 3: Testing and Validation

Comprehensive testing framework with A/B comparison.

### 3.1 Test Suite Development

Create representative test scenarios:

```
Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)
```

### 3.2 A/B Testing Framework

Compare original vs improved agent:

```
Use: parallel-test-runner
Config:
  - Agent A: Original version
  - Agent B: Improved version
  - Test set: 100 representative tasks
  - Metrics: Success rate, speed, token usage
  - Evaluation: Blind human review + automated scoring
```

Statistical significance testing:

- Minimum sample size: 100 tasks per variant
- Confidence level: 95% (p < 0.05)
- Effect size calculation (Cohen's d)
- Power analysis for future tests

### 3.3 Evaluation Metrics

Comprehensive scoring framework:

**Task-Level Metrics:**

- Completion rate (binary success/failure)
- Correctness score (0-100% accuracy)
- Efficiency score (steps taken vs optimal)
- Tool usage appropriateness
- Response relevance and completeness

**Quality Metrics:**

- Hallucination rate (factual errors per response)
- Consistency score (alignment with previous responses)
- Format compliance (matches specified structure)
- Safety score (constraint adherence)
- User satisfaction prediction

**Performance Metrics:**

- Response latency (time to first token)
- Total generation time
- Token consumption (input + output)
- Cost per task (API usage fees)
- Memory/context efficiency

### 3.4 Human Evaluation Protocol

Structured human review process:

- Blind evaluation (evaluators don't know version)
- Standardized rubric with clear criteria
- Multiple evaluators per sample (inter-rater reliability)
- Qualitative feedback collection
- Preference ranking (A vs B comparison)

## Phase 4: Version Control and Deployment

Safe rollout with monitoring and rollback capabilities.

### 4.1 Version Management

Systematic versioning strategy:

```
Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1

MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustments
```

Maintain version history:

- Git-based prompt storage
- Changelog with improvement details
- Performance metrics per version
- Rollback procedures documented

### 4.2 Staged Rollout

Progressive deployment strategy:

1. **Alpha testing**: Internal team validation (5% traffic)
2. **Beta testing**: Selected users (20% traffic)
3. **Canary release**: Gradual increase (20% → 50% → 100%)
4. **Full deployment**: After success criteria met
5. **Monitoring period**: 7-day observation window

### 4.3 Rollback Procedures

Quick recovery mechanism:

```
Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected

Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry
```

### 4.4 Continuous Monitoring

Real-time performance tracking:

- Dashboard with key metrics
- Anomaly detection alerts
- User feedback collection
- Automated regression testing
- Weekly performance reports

## Success Criteria

Agent improvement is successful when:

- Task success rate improves by ≥15%
- User corrections decrease by ≥25%
- No increase in safety violations
- Response time remains within 10% of baseline
- Cost per task doesn't increase >5%
- Positive user feedback increases

## Post-Deployment Review

After 30 days of production use:

1. Analyze accumulated performance data
2. Compare against baseline and targets
3. Identify new improvement opportunities
4. Document lessons learned
5. Plan next optimization cycle

## Continuous Improvement Cycle

Establish regular improvement cadence:

- **Weekly**: Monitor metrics and collect feedback
- **Monthly**: Analyze patterns and plan improvements
- **Quarterly**: Major version updates with new capabilities
- **Annually**: Strategic review and architecture updates

Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.

Related Skills

continuous-improvement-focus

16
from diegosouzapw/awesome-omni-skill

Emphasizes continuous improvement by suggesting process improvements and looking for opportunities to simplify and optimize code and workflows. This rule promotes a culture of ongoing refinement.

beads-orchestration

16
from diegosouzapw/awesome-omni-skill

Multi-agent orchestration for GitHub Issues using BEADS task tracking

apache-airflow-orchestration

16
from diegosouzapw/awesome-omni-skill

Complete guide for Apache Airflow orchestration including DAGs, operators, sensors, XComs, task dependencies, dynamic workflows, and production deployment

workflow-orchestration

16
from diegosouzapw/awesome-omni-skill

Design and implement DAG-based workflows with parallel execution, retries, and error handling. Use when building complex multi-step agent workflows.

ptc-orchestration

16
from diegosouzapw/awesome-omni-skill

Activate when user needs multi-URL scraping, browser automation pipelines, or efficient tool orchestration to reduce API round-trips and context usage.

self-improvement

16
from diegosouzapw/awesome-omni-skill

Zoe's self-improvement system - learns from corrections and user preferences

sk-prompt-improver

16
from diegosouzapw/awesome-omni-skill

Prompt engineering specialist that transforms vague requests into structured, scored AI prompts using 7 proven frameworks (RCAF, COSTAR, RACE, CIDI, TIDD-EC, CRISPE, CRAFT), DEPTH thinking methodology, and CLEAR scoring across text modes.

prompt-improver

16
from diegosouzapw/awesome-omni-skill

Improve prompts for AI agents and Telegram bots using OpenAI's prompt engineering best practices. Analyzes clarity, specificity, context, and output format. Returns structured improvements.

claude-improve-config

16
from diegosouzapw/awesome-omni-skill

Self-reflect on the current session to identify mistakes and propose improvements to .claude configuration (CLAUDE.md, hooks, skills).

ai-orchestration-feedback-loop

16
from diegosouzapw/awesome-omni-skill

Multi-AI engineering loop orchestrating Claude, Codex, and Gemini for comprehensive validation. USE WHEN (1) mission-critical features requiring multi-perspective validation, (2) complex architectural decisions needing diverse AI viewpoints, (3) security-sensitive code requiring deep analysis, (4) user explicitly requests multi-AI review or triple-AI loop. DO NOT USE for simple features or single-file changes. MODES - Triple-AI (full coverage), Dual-AI Codex-Claude (security/logic), Dual-AI Gemini-Claude (UX/creativity).

agents-md-improver

16
from diegosouzapw/awesome-omni-skill

Keeps repo-local agent instructions consistent by proposing updates to AGENTS.md when a user corrects the coding agent or asks to change AGENTS.md, CLAUDE.md, .claude/CLAUDE.md, or GEMINI.md.

agentic-orchestration

16
from diegosouzapw/awesome-omni-skill

Patterns for multi-agent coordination, task decomposition, handoffs, and workflow orchestration. Best practices for building and managing agent systems.