performance-tracking

Track agent, skill, and model performance metrics for optimization. Use when measuring agent success rates, tracking model latency, analyzing routing effectiveness, or optimizing cost-per-task. Trigger keywords - "performance", "metrics", "tracking", "success rate", "agent performance", "model latency", "cost tracking", "optimization", "routing metrics".
248 stars
byMadAppGang
View on GitHub Installation ↓
Best use case

performance-tracking is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Teams using performance-tracking should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.
Installation

Claude Code / Cursor / Codex
$curl -o ~/.claude/skills/performance-tracking/SKILL.md --create-dirs "https://raw.githubusercontent.com/MadAppGang/claude-code/main/plugins/multimodel/skills/performance-tracking/SKILL.md"
Manual Installation
Download SKILL.md from GitHub
Place it in .claude/skills/performance-tracking/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill
How performance-tracking Compares

Feature / Agent	performance-tracking	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A
Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.
Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
SKILL.md Source

# Performance Tracking

**Version:** 1.0.0
**Purpose:** Track agent, skill, and model performance metrics for continuous optimization
**Status:** Production Ready

## Overview

Performance tracking transforms workflows from "fire and forget" to **data-driven optimization systems**. By measuring what actually works, you can route tasks more effectively, identify failing patterns early, and reduce costs.

This skill provides battle-tested patterns for:
- **Agent success tracking** (completion rates, confidence scores, task type affinity)
- **Skill effectiveness** (activation counts, success correlation, usage patterns)
- **Model performance** (latency, cost, quality, provider comparison)
- **Routing optimization** (tier distribution, routing accuracy, cost efficiency)
- **Historical analysis** (trend detection, degradation alerts, pattern discovery)

Performance tracking enables **continuous improvement** by providing the data needed to make informed decisions about agent selection, model choice, and workflow routing.

### Why Track Performance

**Optimize Routing:**
- Identify which agents excel at specific task types
- Route complex tasks to high-confidence agents
- Avoid agents with low success rates for critical work

**Identify Failing Agents:**
- Detect agents with <70% success rate
- Alert when agent performance degrades
- Replace or retrain underperforming agents

**Reduce Costs:**
- Find cost-effective model alternatives
- Identify expensive agents with low success rates
- Optimize tier thresholds based on actual performance

**Improve Quality:**
- Track correlation between confidence scores and success
- Identify patterns in successful implementations
- Learn which models produce best results for task types

### What We Track

**Agent Metrics:**
- Total runs, success/failure counts
- Average confidence scores
- Task type distribution
- Last used timestamp
- Individual execution history

**Skill Metrics:**
- Activation counts per skill
- Last activation timestamps
- Success correlation (when skill active, what's success rate?)
- Co-activation patterns

**Model Metrics:**
- Total runs, success/failure counts
- Average latency (response time)
- Total cost (cumulative spend)
- Cost per successful task
- Last used timestamp

**Routing Metrics:**
- Tier distribution (how often each tier selected)
- Routing accuracy (did tier match complexity?)
- Cost efficiency (tier1 vs tier4 cost ratio)
- Decision history with outcomes

### Integration with task-complexity-router

The performance tracker provides critical feedback to the task-complexity-router:

```
Routing Feedback Loop:

1. Router selects tier based on complexity
   → task-complexity-router analyzes task
   → Routes to tier2 (medium complexity)

2. Agent executes task
   → Records: tier=2, agent=ui-developer, result=success

3. Performance tracker updates metrics
   → tier2 usage +1
   → ui-developer success +1
   → Confidence in tier2 routing increases

4. Future routing decisions informed by history
   → Router sees tier2 has 85% success rate
   → Router sees ui-developer excels at UI tasks
   → Router confidently routes similar tasks to tier2
```

---

## Metrics Schema

### JSON Structure (Version 1.0.0)

Store performance metrics in `.claude/agent-performance.json`:

```json
{
  "schemaVersion": "1.0.0",
  "lastUpdated": "2026-01-28T15:30:00Z",
  "agents": {
    "ui-developer": {
      "totalRuns": 42,
      "successCount": 38,
      "failureCount": 4,
      "avgConfidence": 0.85,
      "lastUsed": "2026-01-28T15:30:00Z",
      "taskTypes": {
        "implement-component": 15,
        "fix-styling": 12,
        "refactor-ui": 8,
        "review-code": 7
      },
      "history": [
        {
          "timestamp": "2026-01-28T15:30:00Z",
          "taskType": "implement-component",
          "result": "success",
          "confidence": 0.90,
          "duration": 45000,
          "tier": 2,
          "model": "claude-sonnet-4-5-20250929"
        },
        {
          "timestamp": "2026-01-28T14:20:00Z",
          "taskType": "fix-styling",
          "result": "success",
          "confidence": 0.85,
          "duration": 30000,
          "tier": 1,
          "model": "claude-sonnet-4-5-20250929"
        }
      ]
    },
    "backend-developer": {
      "totalRuns": 28,
      "successCount": 25,
      "failureCount": 3,
      "avgConfidence": 0.88,
      "lastUsed": "2026-01-28T14:00:00Z",
      "taskTypes": {
        "implement-api": 12,
        "fix-bug": 8,
        "database-migration": 5,
        "write-tests": 3
      },
      "history": []
    }
  },
  "skills": {
    "multi-model-validation": {
      "activations": 15,
      "lastActivated": "2026-01-28T15:00:00Z",
      "successCorrelation": 0.92,
      "coActivations": {
        "quality-gates": 12,
        "error-recovery": 8
      }
    },
    "task-complexity-router": {
      "activations": 68,
      "lastActivated": "2026-01-28T15:30:00Z",
      "successCorrelation": 0.85,
      "coActivations": {
        "multi-agent-coordination": 45,
        "hierarchical-coordinator": 30
      }
    }
  },
  "models": {
    "claude-sonnet-4-5-20250929": {
      "totalRuns": 120,
      "successCount": 108,
      "failureCount": 12,
      "avgLatency": 2500,
      "totalCost": 0.45,
      "lastUsed": "2026-01-28T15:30:00Z",
      "taskTypePerformance": {
        "code-review": { "success": 25, "failure": 2 },
        "implementation": { "success": 40, "failure": 5 },
        "testing": { "success": 20, "failure": 3 }
      }
    },
    "x-ai/grok-code-fast-1": {
      "totalRuns": 35,
      "successCount": 30,
      "failureCount": 5,
      "avgLatency": 1800,
      "totalCost": 0.08,
      "lastUsed": "2026-01-28T13:00:00Z",
      "taskTypePerformance": {
        "code-review": { "success": 18, "failure": 2 },
        "implementation": { "success": 12, "failure": 3 }
      }
    }
  },
  "routing": {
    "tierDistribution": {
      "tier1": 45,
      "tier2": 30,
      "tier3": 15,
      "tier4": 8
    },
    "decisions": [
      {
        "timestamp": "2026-01-28T15:30:00Z",
        "taskType": "implement-component",
        "complexity": "medium",
        "selectedTier": 2,
        "agent": "ui-developer",
        "result": "success",
        "cost": 0.003
      },
      {
        "timestamp": "2026-01-28T14:20:00Z",
        "taskType": "fix-styling",
        "complexity": "low",
        "selectedTier": 1,
        "agent": "ui-developer",
        "result": "success",
        "cost": 0.001
      }
    ]
  }
}
```

### Schema Field Definitions

**Agent Metrics:**
- `totalRuns`: Total task executions
- `successCount`: Tasks completed successfully
- `failureCount`: Tasks that failed or required retry
- `avgConfidence`: Rolling average of agent confidence scores (0.0-1.0)
- `lastUsed`: ISO-8601 timestamp of last execution
- `taskTypes`: Distribution of task types (understand agent specialization)
- `history`: Array of recent executions (max 100 entries, FIFO)

**Skill Metrics:**
- `activations`: Total times skill was triggered
- `lastActivated`: ISO-8601 timestamp
- `successCorrelation`: Success rate when this skill is active (0.0-1.0)
- `coActivations`: Skills frequently activated together (detect patterns)

**Model Metrics:**
- `totalRuns`: Total executions
- `successCount`/`failureCount`: Outcome tracking
- `avgLatency`: Average response time in milliseconds
- `totalCost`: Cumulative spend in USD
- `lastUsed`: ISO-8601 timestamp
- `taskTypePerformance`: Success/failure breakdown by task type

**Routing Metrics:**
- `tierDistribution`: Count of tasks routed to each tier
- `decisions`: Array of routing decisions with outcomes (max 100, FIFO)

---

## Tracking Patterns

### Pattern 1: Capturing Agent Performance

**After Agent Completes Task:**

```
Execution Flow:

1. Agent executes task
   Task: ui-developer
   Input: "Implement login form component"
   Result: Success
   Confidence: 0.90
   Duration: 45 seconds
   Tier: 2
   Model: claude-sonnet-4-5-20250929

2. Update agent metrics
   Read: .claude/agent-performance.json
   Update:
     agents["ui-developer"].totalRuns += 1
     agents["ui-developer"].successCount += 1
     agents["ui-developer"].avgConfidence = rolling_avg(0.90)
     agents["ui-developer"].lastUsed = NOW
     agents["ui-developer"].taskTypes["implement-component"] += 1
     agents["ui-developer"].history.push({
       timestamp: NOW,
       taskType: "implement-component",
       result: "success",
       confidence: 0.90,
       duration: 45000,
       tier: 2,
       model: "claude-sonnet-4-5-20250929"
     })
   Trim history if > 100 entries
   Write: .claude/agent-performance.json

3. Calculate derived metrics
   Success rate: successCount / totalRuns = 38/42 = 90.5%
   Avg duration: sum(history.duration) / history.length
   Task affinity: taskTypes sorted by count
```

**After Agent Fails:**

```
Failure Flow:

1. Agent fails task
   Task: backend-developer
   Input: "Implement complex payment flow"
   Result: Failure (error, timeout, or low quality)
   Confidence: 0.65
   Tier: 3

2. Update failure metrics
   agents["backend-developer"].totalRuns += 1
   agents["backend-developer"].failureCount += 1
   agents["backend-developer"].avgConfidence = rolling_avg(0.65)
   agents["backend-developer"].history.push({
     timestamp: NOW,
     taskType: "implement-api",
     result: "failure",
     confidence: 0.65,
     duration: 120000,
     tier: 3,
     error: "Exceeded max iterations"
   })

3. Check for degradation
   If failureCount / totalRuns > 0.30:
     Alert: "backend-developer has >30% failure rate"
     Recommendation: "Review recent failures, retrain, or replace"
```

### Pattern 2: Tracking Model Performance

**After Model Execution:**

```
Execution Flow:

1. Model completes task
   Model: x-ai/grok-code-fast-1
   Task: Code review
   Latency: 1800ms
   Cost: $0.002
   Result: Success

2. Update model metrics
   models["x-ai/grok-code-fast-1"].totalRuns += 1
   models["x-ai/grok-code-fast-1"].successCount += 1
   models["x-ai/grok-code-fast-1"].avgLatency = rolling_avg(1800)
   models["x-ai/grok-code-fast-1"].totalCost += 0.002
   models["x-ai/grok-code-fast-1"].lastUsed = NOW
   models["x-ai/grok-code-fast-1"].taskTypePerformance["code-review"].success += 1

3. Compare model performance
   Claude Sonnet: avgLatency=2500ms, cost=$0.45 (120 runs)
   Grok Fast: avgLatency=1800ms, cost=$0.08 (35 runs)

   Analysis:
   - Grok is 28% faster (1800ms vs 2500ms)
   - Grok is 82% cheaper per run ($0.0023 vs $0.0038)
   - Both have similar success rates (86% vs 90%)

   Recommendation:
   - Use Grok for cost-sensitive tasks
   - Use Claude for critical tasks (higher success rate)
```

### Pattern 3: Recording Skill Activation

**After Skill Activation:**

```
Activation Flow:

1. Skill triggers
   Skill: multi-model-validation
   Context: User requested /review with 3 models

2. Update skill metrics
   skills["multi-model-validation"].activations += 1
   skills["multi-model-validation"].lastActivated = NOW

3. Track co-activation
   Active skills: ["multi-model-validation", "quality-gates"]
   skills["multi-model-validation"].coActivations["quality-gates"] += 1

4. Calculate success correlation
   Tasks with this skill active: 15
   Successful tasks: 14
   Success correlation: 14/15 = 93.3%

5. Pattern detection
   Observation: multi-model-validation + quality-gates = 100% success (12/12)
   Recommendation: Always pair these skills for high-quality reviews
```

### Pattern 4: Routing Decision Tracking

**After Routing Decision:**

```
Routing Flow:

1. Router selects tier
   Task: "Implement user profile page"
   Analysis: Medium complexity (multiple components, state management)
   Selected tier: 2
   Agent: ui-developer
   Model: claude-sonnet-4-5-20250929

2. Record routing decision
   routing.tierDistribution["tier2"] += 1
   routing.decisions.push({
     timestamp: NOW,
     taskType: "implement-component",
     complexity: "medium",
     selectedTier: 2,
     agent: "ui-developer",
     result: "pending"
   })

3. After task completes
   Update decision with result:
   routing.decisions[last].result = "success"
   routing.decisions[last].cost = 0.003

4. Trim decision history if > 100 entries
```

### Pattern 5: Session-Level Aggregation

**End of Session Summary:**

```
Session Summary Flow:

1. Aggregate session metrics
   Session ID: 2026-01-28-session-15
   Duration: 2 hours
   Tasks executed: 15
   Success rate: 14/15 = 93.3%
   Total cost: $0.045
   Models used: Claude (12), Grok (3)

2. Create session snapshot
   File: ai-docs/performance-history/2026-01-28-session-15.json
   Content:
     {
       "sessionId": "2026-01-28-session-15",
       "startTime": "2026-01-28T13:00:00Z",
       "endTime": "2026-01-28T15:00:00Z",
       "duration": 7200000,
       "tasks": 15,
       "successRate": 0.933,
       "totalCost": 0.045,
       "modelUsage": { "claude": 12, "grok": 3 },
       "topAgents": ["ui-developer", "backend-developer"],
       "activeSkills": ["task-complexity-router", "multi-model-validation"]
     }

3. Update rolling metrics
   .claude/agent-performance.json (persistent)
   ai-docs/performance-history/ (snapshots)

4. Cleanup old snapshots
   Keep last 100 session snapshots
   Delete older entries
```

---

## File Location and Management

### Primary Performance File

**Location:** `.claude/agent-performance.json`

**Purpose:** Persistent, project-level performance tracking

**When to Update:**
- After every agent execution
- After every model execution
- After every skill activation
- After every routing decision

**Format:** JSON schema version 1.0.0 (see Metrics Schema section)

**Rotation:** Keep full history, but trim individual history arrays to 100 entries

### Session Snapshots

**Location:** `ai-docs/performance-history/`

**Purpose:** Point-in-time session summaries for historical analysis

**Naming:** `{YYYY-MM-DD}-session-{N}.json`

**Example:**
```
ai-docs/performance-history/
  2026-01-28-session-1.json
  2026-01-28-session-2.json
  2026-01-27-session-1.json
  ...
```

**Retention:** Keep last 100 sessions, delete older

### Integration with Existing Files

**Relationship with ai-docs/llm-performance.json:**

```
Comparison:

llm-performance.json (existing):
  - Model-specific performance
  - Cost tracking per model
  - Response time tracking
  - Used by multi-model-validation

agent-performance.json (new):
  - Agent-level metrics (multi-run aggregation)
  - Skill activation tracking
  - Routing decision history
  - Task type affinity

Integration:
  - agent-performance.json imports model data from llm-performance.json
  - Both files updated in parallel
  - llm-performance.json focuses on single-run details
  - agent-performance.json focuses on aggregate trends
```

**Migration Path:**

```
Step 1: Create .claude/agent-performance.json with schema 1.0.0
Step 2: Import historical data from llm-performance.json
Step 3: Update both files going forward
Step 4: Deprecate llm-performance.json after 6 months (optional)
```

### Data Cleanup and Rotation

**Automatic Cleanup:**

```
Cleanup Rules:

1. Agent history arrays
   Max entries: 100
   Strategy: FIFO (oldest removed first)
   Trigger: After every agent execution

2. Routing decision arrays
   Max entries: 100
   Strategy: FIFO
   Trigger: After every routing decision

3. Session snapshots
   Max files: 100
   Strategy: FIFO (delete oldest session files)
   Trigger: After every session ends

4. Skill co-activation maps
   Max entries per skill: 50
   Strategy: Keep top 50 by count
   Trigger: Weekly cleanup
```

**Manual Cleanup:**

```
When to manually reset:

1. After major workflow changes
   - Agent capabilities changed
   - New skills added
   - Routing logic updated
   → Reset metrics to start fresh

2. After agent retraining
   - Agent prompt updated
   - Agent model changed
   → Reset agent-specific metrics

3. After prolonged period (>6 months)
   - Metrics may be outdated
   → Archive old data, start fresh

How to reset:
  Backup: cp .claude/agent-performance.json .claude/agent-performance-backup-{DATE}.json
  Reset: echo '{"schemaVersion":"1.0.0","lastUpdated":"...","agents":{},...}' > .claude/agent-performance.json
```

---

## Using Metrics for Optimization

### Optimization 1: Identify Underperforming Agents

**Detection:**

```
Analyze agent success rates:

agents["ui-developer"]:
  successCount: 38
  totalRuns: 42
  success rate: 38/42 = 90.5% ✅ GOOD

agents["test-architect"]:
  successCount: 15
  totalRuns: 25
  success rate: 15/25 = 60% ❌ UNDERPERFORMING

Threshold: <70% success rate = underperforming
```

**Action:**

```
For test-architect (60% success):

1. Analyze failure patterns
   Review history entries where result="failure"
   Common failure reasons:
     - "Tests too brittle" (8 occurrences)
     - "Missing test coverage" (5 occurrences)
     - "Test timeout" (2 occurrences)

2. Identify root cause
   Pattern: test-architect struggles with async/timing tests
   Evidence: All timeout failures involved async code

3. Take action
   Option A: Retrain agent
     - Update prompt with async testing best practices
     - Add examples of proper async test patterns
     - Reset metrics after retraining

   Option B: Route differently
     - Route async test tasks to backend-developer (90% success on async)
     - Keep test-architect for synchronous unit tests

   Option C: Replace agent
     - Create new specialized-async-test-architect
     - Deprecate test-architect for async work
```

### Optimization 2: Find Cost-Effective Model Alternatives

**Analysis:**

```
Compare model cost-effectiveness:

Model: claude-sonnet-4-5-20250929
  Total cost: $0.45
  Total runs: 120
  Success count: 108
  Cost per task: $0.0038
  Cost per success: $0.0042
  Success rate: 90%

Model: x-ai/grok-code-fast-1
  Total cost: $0.08
  Total runs: 35
  Success count: 30
  Cost per task: $0.0023
  Cost per success: $0.0027
  Success rate: 86%

Model: google/gemini-2.5-flash
  Total cost: $0.02
  Total runs: 20
  Success count: 16
  Cost per task: $0.0010
  Cost per success: $0.0013
  Success rate: 80%

Cost Efficiency Ranking:
  1. Gemini Flash: $0.0013 per success (80% success rate)
  2. Grok Fast: $0.0027 per success (86% success rate)
  3. Claude Sonnet: $0.0042 per success (90% success rate)

Quality-Cost Tradeoff:
  - Gemini: 69% cheaper than Claude, but 10% lower success rate
  - Grok: 36% cheaper than Claude, but 4% lower success rate
```

**Action:**

```
Optimization strategy:

Tier 1 (Simple tasks):
  Use: Gemini Flash
  Reason: Lowest cost, acceptable success rate for simple work
  Example: "Fix typo in comment", "Format code"

Tier 2 (Medium tasks):
  Use: Grok Fast
  Reason: Good balance of cost and quality
  Example: "Implement CRUD endpoint", "Add validation"

Tier 3 (Complex tasks):
  Use: Claude Sonnet
  Reason: Highest success rate justifies cost
  Example: "Design architecture", "Complex refactoring"

Tier 4 (Critical tasks):
  Use: Claude Sonnet + Multi-model validation
  Reason: Quality > cost for critical work
  Example: "Security review", "Production bug fix"

Expected savings:
  Current: 90% Claude usage × $0.0042 = $0.00378 avg per task
  Optimized: 20% Claude + 50% Grok + 30% Gemini = $0.00257 avg per task
  Savings: 32% cost reduction with minimal quality impact
```

### Optimization 3: Optimize Routing Tier Thresholds

**Analysis:**

```
Review tier distribution:

routing.tierDistribution:
  tier1: 45 tasks (45.9%)
  tier2: 30 tasks (30.6%)
  tier3: 15 tasks (15.3%)
  tier4: 8 tasks (8.2%)

Analyze tier accuracy:

Tier 1 (Simple):
  Tasks: 45
  Success: 42
  Failures: 3
  Success rate: 93.3% ✅
  Verdict: Well-calibrated

Tier 2 (Medium):
  Tasks: 30
  Success: 25
  Failures: 5
  Success rate: 83.3% ⚠️
  Verdict: Slightly low (target 90%)

Tier 3 (Complex):
  Tasks: 15
  Success: 12
  Failures: 3
  Success rate: 80.0% ⚠️
  Verdict: Too low (target 90%)

Tier 4 (Critical):
  Tasks: 8
  Success: 8
  Failures: 0
  Success rate: 100% ✅
  Verdict: Well-calibrated
```

**Action:**

```
Adjust tier thresholds:

Current thresholds (task-complexity-router):
  tier1: complexity score 0-3
  tier2: complexity score 4-6
  tier3: complexity score 7-9
  tier4: complexity score 10+

Problem: tier2 and tier3 have lower success rates
Root cause: Tasks slightly too complex for assigned tier

Optimized thresholds:
  tier1: complexity score 0-2 (narrower range)
  tier2: complexity score 3-5 (shift down)
  tier3: complexity score 6-8 (shift down)
  tier4: complexity score 9+ (broader range)

Rationale:
  - Shift more borderline tasks to higher tiers
  - Accept slightly higher cost for better success rates
  - tier2/tier3 success should improve to 90%+

Expected impact:
  - tier1 usage: 45 → 35 tasks (fewer simple tasks)
  - tier2 usage: 30 → 32 tasks (more medium tasks)
  - tier3 usage: 15 → 18 tasks (more complex tasks)
  - tier4 usage: 8 → 13 tasks (more critical tasks)
  - Overall success rate: 88% → 92%
  - Overall cost: +15% (acceptable tradeoff for quality)
```

### Optimization 4: Detect Model-Task Affinity Patterns

**Analysis:**

```
Analyze task type performance by model:

Task type: code-review

Claude Sonnet:
  Success: 25, Failure: 2
  Success rate: 92.6% ✅

Grok Fast:
  Success: 18, Failure: 2
  Success rate: 90.0% ✅

Gemini Flash:
  Success: 10, Failure: 4
  Success rate: 71.4% ⚠️

→ Pattern: Claude and Grok excel at code review, Gemini struggles

Task type: implementation

Claude Sonnet:
  Success: 40, Failure: 5
  Success rate: 88.9% ✅

Grok Fast:
  Success: 12, Failure: 3
  Success rate: 80.0% ⚠️

Gemini Flash:
  Success: 6, Failure: 1
  Success rate: 85.7% ✅

→ Pattern: Claude best for implementation, Grok/Gemini acceptable

Task type: testing

Claude Sonnet:
  Success: 20, Failure: 3
  Success rate: 87.0% ✅

Grok Fast:
  Success: 0, Failure: 0
  Success rate: N/A

Gemini Flash:
  Success: 0, Failure: 0
  Success rate: N/A

→ Pattern: Only Claude has testing data (others not used for this)
```

**Action:**

```
Task-specific model routing:

code-review tasks:
  tier1: Grok Fast (90% success, low cost)
  tier2: Grok Fast (90% success, low cost)
  tier3: Claude Sonnet (93% success, high quality)
  tier4: Multi-model (Claude + Grok consensus)

implementation tasks:
  tier1: Gemini Flash (86% success, lowest cost)
  tier2: Grok Fast (80% success, medium cost)
  tier3: Claude Sonnet (89% success, highest quality)
  tier4: Claude Sonnet (89% success, proven)

testing tasks:
  tier1-4: Claude Sonnet (only model with proven testing capability)

Expected impact:
  - 25% cost savings on code reviews (use Grok instead of Claude)
  - 10% cost savings on implementation (use Gemini for simple)
  - Maintain quality (route by proven success rates)
```

### Optimization 5: Alert on Performance Degradation

**Detection:**

```
Monitor for degradation:

Week 1 (baseline):
  ui-developer success rate: 90.5%
  Average task duration: 45s

Week 2:
  ui-developer success rate: 88.2% (↓2.3%)
  Average task duration: 48s (↑3s)

Week 3:
  ui-developer success rate: 85.1% (↓5.4% from baseline)
  Average task duration: 52s (↑7s from baseline)

Week 4:
  ui-developer success rate: 78.3% (↓12.2% from baseline) 🚨
  Average task duration: 58s (↑13s from baseline) 🚨

Threshold exceeded:
  ❌ Success rate dropped >10% (78.3% vs 90.5%)
  ❌ Duration increased >20% (58s vs 45s)

→ ALERT: ui-developer performance degraded significantly
```

**Action:**

```
Degradation response:

1. Investigate root cause
   Review recent history:
     - Task complexity increased? (Check taskTypes distribution)
     - Model changed? (Check model field in history)
     - Failures clustered around specific task type?

   Finding: All recent failures on "complex-state-management" tasks
   Root cause: New task type introduced, agent not trained for it

2. Take corrective action
   Option A: Retrain agent
     - Update prompt with state management patterns
     - Add examples of successful state management
     - Reset metrics after retraining

   Option B: Route differently
     - Route state management tasks to specialized agent
     - Keep ui-developer for simpler UI tasks

   Option C: Escalate to human
     - Alert: "ui-developer performance degraded"
     - Request: "Manual review of recent failures needed"

3. Monitor recovery
   Week 5 (after retraining):
     Success rate: 85.0% (recovering)
   Week 6:
     Success rate: 89.2% (near baseline)
   Week 7:
     Success rate: 91.0% (recovered ✅)
```

---

## Integration with Orchestration Plugin

### Integration 1: multi-model-validation

**How multi-model-validation records model performance:**

```
Multi-Model Review Flow:

1. Execute parallel review
   Models: [claude-sonnet, grok-fast, gemini-flash]
   Task: Code review of auth.ts

2. Collect model responses
   Each model returns:
     - Review findings
     - Confidence score
     - Latency
     - Cost

3. Record individual model performance
   For each model:
     models[modelId].totalRuns += 1
     models[modelId].avgLatency = rolling_avg(latency)
     models[modelId].totalCost += cost

4. Determine success/failure
   If review found critical issues → success (doing its job)
   If review crashed/errored → failure

5. Update success counts
   models[modelId].successCount += 1  (or failureCount)
   models[modelId].taskTypePerformance["code-review"].success += 1

6. Consolidate findings
   Generate consensus report
   Track which models agreed (co-occurrence patterns)

7. User feedback (optional)
   User rates review quality: "Helpful" | "Not helpful"
   Update successCorrelation for multi-model-validation skill
```

### Integration 2: task-complexity-router

**How task-complexity-router reads performance data:**

```
Routing Decision Flow:

1. Analyze task complexity
   Input: "Implement user authentication with OAuth"
   Analysis: Complex (multiple components, external API, security)
   Base tier: 3

2. Read performance history
   Load: .claude/agent-performance.json
   Check: routing.tierDistribution

3. Adjust tier based on history
   tier3 historical success rate: 80% (below 90% target)
   tier4 historical success rate: 100%

   Decision: Bump to tier4 for higher success probability

4. Select agent based on task type affinity
   Task type: "implement-api"
   Candidates: backend-developer, full-stack-developer

   Check affinity:
     backend-developer.taskTypes["implement-api"]: 12 (high affinity)
     full-stack-developer.taskTypes["implement-api"]: 3 (low affinity)

   Decision: Select backend-developer (proven track record)

5. Select model based on tier + task type
   tier4 + implement-api:
     models[claude].taskTypePerformance["implementation"]: 89% success
     models[grok].taskTypePerformance["implementation"]: 80% success

   Decision: Select Claude (higher success rate for tier4)

6. Record routing decision
   routing.decisions.push({
     timestamp: NOW,
     taskType: "implement-api",
     complexity: "complex",
     selectedTier: 4,
     agent: "backend-developer",
     model: "claude-sonnet-4-5-20250929",
     result: "pending"
   })

7. After execution, update result
   routing.decisions[last].result = "success"
   routing.decisions[last].cost = 0.005
```

### Integration 3: hierarchical-coordinator

**How hierarchical-coordinator tracks phase success:**

```
Phase Execution Tracking:

1. Execute workflow phases
   Phase 1: Planning (architect agent)
   Phase 2: Implementation (developer agent)
   Phase 3: Testing (tester agent)
   Phase 4: Review (reviewer agent)

2. Track phase-level metrics
   Create phase-specific tracking:

   agents["architect"].phasePerformance = {
     "planning": { success: 15, failure: 2 },
     "architecture": { success: 8, failure: 1 }
   }

   agents["developer"].phasePerformance = {
     "implementation": { success: 25, failure: 5 },
     "refactoring": { success: 10, failure: 2 }
   }

3. Detect phase-specific issues
   Analysis: developer has 20% failure rate on implementation phase
   But: developer has 83% success rate overall

   Insight: Failures concentrated in specific phase

4. Optimize phase assignment
   Current: developer handles all implementation
   Optimized: Split by complexity
     - Simple implementation → junior-developer (cheaper)
     - Complex implementation → senior-developer (higher success)

5. Track coordinator effectiveness
   skills["hierarchical-coordinator"].activations += 1
   skills["hierarchical-coordinator"].successCorrelation = 0.92

   Insight: Workflows using coordinator have 92% success (vs 80% without)
```

### Integration 4: quality-gates

**How quality-gates uses performance thresholds:**

```
Quality Gate Decision:

1. Agent completes task
   Agent: ui-developer
   Task: "Implement dashboard component"
   Confidence: 0.75

2. Check agent performance history
   Load: agents["ui-developer"]
   Historical avg confidence: 0.85
   Current confidence: 0.75 (below average 🚨)

3. Apply quality gate
   Threshold: If confidence < avg - 0.10, trigger validation

   Decision: 0.75 < 0.75 (borderline)
   Action: Trigger designer validation (extra quality check)

4. Designer validates
   Result: Found 3 minor issues
   Verdict: Quality gate prevented low-quality work from proceeding

5. Update metrics
   Without gate: ui-developer would have 1 more failure
   With gate: Issues caught early, fixed before user sees

   skills["quality-gates"].successCorrelation += 1
   (Success correlation increases when gate prevents failures)

6. Continuous improvement
   Pattern: Low-confidence tasks benefit from extra validation
   Threshold: Automatically adjust based on correlation data
   Future: If confidence < 0.80, always trigger validation
```

---

## Best Practices

### Do

- ✅ **Track all agent executions** (success and failure provide learning signal)
- ✅ **Record model latency and cost** (optimize for cost-effectiveness)
- ✅ **Maintain execution history** (detect patterns and trends)
- ✅ **Set success rate thresholds** (<70% = investigate, <50% = replace)
- ✅ **Alert on performance degradation** (>10% drop from baseline)
- ✅ **Use task type affinity** (route tasks to agents with proven success)
- ✅ **Compare model cost-effectiveness** (cost per success, not just cost per task)
- ✅ **Track skill co-activation** (identify successful skill combinations)
- ✅ **Rotate history data** (keep last 100 entries, prevent unbounded growth)
- ✅ **Create session snapshots** (point-in-time analysis)
- ✅ **Integrate with routing** (feed performance data back to router)

### Don't

- ❌ **Track only successes** (failures provide valuable learning signal)
- ❌ **Ignore degradation** (small drops compound into big problems)
- ❌ **Use stale data** (>6 months old metrics may not reflect current state)
- ❌ **Over-optimize on cost alone** (balance cost and quality)
- ❌ **Forget to update metrics** (incomplete data leads to poor decisions)
- ❌ **Store unbounded history** (trim arrays to prevent file bloat)
- ❌ **Mix session metrics** (isolate session data for cleaner analysis)
- ❌ **Ignore task type affinity** (agents specialize, use it)
- ❌ **Skip validation after major changes** (reset metrics when workflows change)

### Privacy Considerations

**What to Track:**
- Aggregate metrics (counts, averages, distributions)
- Task types (generic categories like "implement-component")
- Success/failure outcomes
- Model performance data
- Timing and cost data

**What NOT to Track:**
- User-specific data (usernames, emails)
- Sensitive code snippets
- API keys or credentials
- Personal information
- Business logic details

**Data Retention:**
- Keep aggregate metrics indefinitely (no PII)
- Rotate detailed history after 100 entries
- Delete session snapshots after 100 sessions
- Archive old metrics before major resets

### When to Reset Metrics

**Situations Requiring Reset:**

1. **Agent capabilities changed**
   - Prompt updated significantly
   - Agent model changed
   - Agent skills added/removed
   → Reset agent-specific metrics

2. **Workflow architecture changed**
   - New routing logic
   - New tier definitions
   - New skill combinations
   → Reset routing and skill metrics

3. **Model pricing changed**
   - Cost per token updated
   - New pricing tier
   → Reset cost calculations (keep counts)

4. **After prolonged period (>6 months)**
   - Metrics may be outdated
   - Workflow patterns changed
   → Archive and reset all metrics

**How to Reset:**

```bash
# Backup current metrics
cp .claude/agent-performance.json .claude/agent-performance-backup-$(date +%Y%m%d).json

# Reset to empty state
cat > .claude/agent-performance.json << 'EOF'
{
  "schemaVersion": "1.0.0",
  "lastUpdated": "2026-01-28T16:00:00Z",
  "agents": {},
  "skills": {},
  "models": {},
  "routing": {
    "tierDistribution": {},
    "decisions": []
  }
}
EOF

# Archive old session snapshots
mkdir -p ai-docs/performance-history/archive-$(date +%Y%m%d)
mv ai-docs/performance-history/*.json ai-docs/performance-history/archive-$(date +%Y%m%d)/
```

### Metric Hygiene

**Regular Maintenance:**

```
Weekly:
  - Review top agents (ensure success rates >70%)
  - Check model cost trends (identify cost spikes)
  - Trim co-activation maps (keep top 50 per skill)

Monthly:
  - Analyze task type affinity changes
  - Compare model cost-effectiveness
  - Review tier distribution accuracy
  - Archive old session snapshots (keep last 100)

Quarterly:
  - Deep analysis of performance trends
  - Optimize routing thresholds
  - Identify underperforming patterns
  - Consider agent retraining or replacement

Annually:
  - Full metrics review and reset (if needed)
  - Archive all historical data
  - Update baseline success rates
  - Document lessons learned
```

---

## Examples

### Example 1: Tracking a Multi-Model Review Session

**Scenario:** User requests `/review` with 3 models (Claude, Grok, Gemini)

**Execution:**

```
Step 1: Initialize session
  Session ID: 2026-01-28-session-15
  Start time: 15:00:00Z

Step 2: Execute multi-model review
  Models: [claude-sonnet, grok-fast, gemini-flash]
  Task: Code review of auth/login.ts (450 lines)

Step 3: Track individual model executions

  Model: claude-sonnet-4-5-20250929
    Start: 15:00:05Z
    End: 15:00:08Z
    Latency: 3000ms
    Cost: $0.003
    Result: Found 5 issues (2 CRITICAL, 3 HIGH)
    Outcome: Success

  Update metrics:
    models["claude-sonnet-4-5-20250929"].totalRuns = 121
    models["claude-sonnet-4-5-20250929"].successCount = 109
    models["claude-sonnet-4-5-20250929"].avgLatency = 2520ms
    models["claude-sonnet-4-5-20250929"].totalCost = $0.453
    models["claude-sonnet-4-5-20250929"].taskTypePerformance["code-review"].success = 26

  Model: x-ai/grok-code-fast-1
    Start: 15:00:05Z
    End: 15:00:07Z
    Latency: 2000ms
    Cost: $0.002
    Result: Found 4 issues (2 CRITICAL, 2 HIGH)
    Outcome: Success

  Update metrics:
    models["x-ai/grok-code-fast-1"].totalRuns = 36
    models["x-ai/grok-code-fast-1"].successCount = 31
    models["x-ai/grok-code-fast-1"].avgLatency = 1820ms
    models["x-ai/grok-code-fast-1"].totalCost = $0.082
    models["x-ai/grok-code-fast-1"].taskTypePerformance["code-review"].success = 19

  Model: google/gemini-2.5-flash
    Start: 15:00:05Z
    End: 15:00:06Z
    Latency: 1500ms
    Cost: $0.001
    Result: Found 3 issues (1 CRITICAL, 2 MEDIUM)
    Outcome: Success

  Update metrics:
    models["google/gemini-2.5-flash"].totalRuns = 21
    models["google/gemini-2.5-flash"].successCount = 17
    models["google/gemini-2.5-flash"].avgLatency = 1480ms
    models["google/gemini-2.5-flash"].totalCost = $0.021
    models["google/gemini-2.5-flash"].taskTypePerformance["code-review"].success = 11

Step 4: Track skill activation
  skills["multi-model-validation"].activations = 16
  skills["multi-model-validation"].lastActivated = 15:00:08Z
  skills["multi-model-validation"].coActivations["quality-gates"] = 13

Step 5: Consolidate findings
  Consensus issues (all 3 models agreed):
    - CRITICAL: SQL injection vulnerability (UNANIMOUS)
    - CRITICAL: Missing authentication check (UNANIMOUS)

  Majority issues (2/3 models agreed):
    - HIGH: Insufficient input validation (Claude, Grok)
    - HIGH: Missing error handling (Claude, Grok)

  Divergent issues (1/3 models):
    - MEDIUM: Code duplication (Gemini only)

Step 6: Record session summary
  Session complete:
    Duration: 8 seconds
    Models used: 3
    Total cost: $0.006
    Issues found: 5 (2 unanimous, 2 majority, 1 divergent)
    Result: Success

  Create snapshot:
    ai-docs/performance-history/2026-01-28-session-15.json

Step 7: Update aggregate metrics
  Overall session success rate: 3/3 models successful = 100%
  Cost efficiency: $0.002 per model = good value
```

**Insights from Tracking:**

```
Performance comparison:
  Fastest: Gemini (1500ms) - 50% faster than Claude
  Most thorough: Claude (5 issues) - Found 1 extra issue
  Best value: Gemini ($0.001, 3 issues) - Lowest cost, good coverage

Cost analysis:
  Total: $0.006 for 3-model review
  vs Single Claude: $0.003 (double cost, but 2x validation)
  ROI: Found 2 CRITICAL issues all models agreed on = high confidence

Consensus validation:
  UNANIMOUS issues (100% confidence) → Fix immediately
  MAJORITY issues (67% confidence) → Fix before merge
  DIVERGENT issues (33% confidence) → Low priority (possible false positive)

Recommendation:
  Multi-model validation worth the cost for critical code (auth, payments, security)
  Single-model sufficient for non-critical code (UI components, docs)
```

---

### Example 2: Identifying Model Performance Differences

**Scenario:** After 100 tasks, compare model performance for optimization

**Execution:**

```
Step 1: Load performance data
  Read: .claude/agent-performance.json

Step 2: Extract model metrics

  Claude Sonnet:
    Total runs: 120
    Success: 108, Failure: 12
    Success rate: 90.0%
    Avg latency: 2500ms
    Total cost: $0.45
    Cost per task: $0.00375
    Cost per success: $0.00417

  Grok Fast:
    Total runs: 35
    Success: 30, Failure: 5
    Success rate: 85.7%
    Avg latency: 1800ms
    Total cost: $0.08
    Cost per task: $0.00229
    Cost per success: $0.00267

  Gemini Flash:
    Total runs: 20
    Success: 16, Failure: 4
    Success rate: 80.0%
    Avg latency: 1500ms
    Total cost: $0.02
    Cost per task: $0.00100
    Cost per success: $0.00125

Step 3: Analyze task type performance

  Code Review:
    Claude: 25 success, 2 failure = 92.6%
    Grok: 18 success, 2 failure = 90.0%
    Gemini: 10 success, 4 failure = 71.4%

    Winner: Claude (highest quality)
    Best value: Grok (90% at lower cost)

  Implementation:
    Claude: 40 success, 5 failure = 88.9%
    Grok: 12 success, 3 failure = 80.0%
    Gemini: 6 success, 1 failure = 85.7%

    Winner: Claude (highest quality)
    Surprising: Gemini performs well here (86% success)

  Testing:
    Claude: 20 success, 3 failure = 87.0%
    Grok: No data
    Gemini: No data

    Winner: Claude (only option)
    Action: Try Grok/Gemini for testing tasks to gather data

Step 4: Calculate cost-effectiveness by task type

  Code Review Cost-Effectiveness:
    Claude: $0.00417 per success, 92.6% quality
    Grok: $0.00267 per success, 90.0% quality (36% cheaper, -2.6% quality)
    Gemini: $0.00125 per success, 71.4% quality (70% cheaper, -21.2% quality)

    Recommendation: Use Grok for cost-effective reviews (minimal quality loss)

  Implementation Cost-Effectiveness:
    Claude: $0.00417 per success, 88.9% quality
    Grok: $0.00267 per success, 80.0% quality (36% cheaper, -8.9% quality)
    Gemini: $0.00125 per success, 85.7% quality (70% cheaper, -3.2% quality)

    Recommendation: Use Gemini for simple implementation (best value)

Step 5: Generate optimization plan

  Current usage (120 total tasks):
    Claude: 100 tasks (83%)
    Grok: 15 tasks (13%)
    Gemini: 5 tasks (4%)

  Optimized usage (maintain quality >85%):
    tier1 (Simple): Gemini (30% of tasks)
    tier2 (Medium): Grok (40% of tasks)
    tier3 (Complex): Claude (25% of tasks)
    tier4 (Critical): Claude + Multi-model (5% of tasks)

  Expected impact:
    Current avg cost: $0.00375 per task
    Optimized avg cost: $0.00240 per task
    Savings: 36% cost reduction

    Current avg success: 88.5%
    Optimized avg success: 86.2% (projected)
    Quality impact: -2.3% (acceptable tradeoff)

Step 6: Implement gradual rollout

  Week 1: Route 20% of tier1 tasks to Gemini
    Monitor: Success rate, cost savings
    Target: >80% success rate

  Week 2: Route 40% of tier2 tasks to Grok
    Monitor: Success rate, cost savings
    Target: >85% success rate

  Week 3: Evaluate results
    If successful: Increase percentages
    If unsuccessful: Rollback and investigate

Step 7: Track optimization results

  After 2 weeks:
    Gemini tier1 success: 82% ✅ (above 80% target)
    Grok tier2 success: 87% ✅ (above 85% target)
    Cost savings: 28% ✅ (approaching 36% target)

  Decision: Continue rollout
  Next: Route 50% tier1 to Gemini, 60% tier2 to Grok
```

**Insights from Analysis:**

```
Key findings:
  1. Grok is best value for code reviews (90% quality at 36% lower cost)
  2. Gemini surprisingly good for implementation (86% vs 89% Claude)
  3. Claude still best for critical work (92% code review success)
  4. Latency varies significantly (Gemini 40% faster than Claude)

Optimization strategy:
  - Use Gemini for simple, latency-sensitive tasks
  - Use Grok for medium-complexity, cost-sensitive tasks
  - Use Claude for critical, quality-sensitive tasks
  - Use multi-model for maximum confidence (despite cost)

Expected ROI:
  - 36% cost reduction (from $0.00375 to $0.00240 per task)
  - 2.3% quality tradeoff (from 88.5% to 86.2% success)
  - Worth it: Save $135 per 100,000 tasks with minimal quality impact
```

---

### Example 3: Optimizing Routing Based on Accumulated Data

**Scenario:** After 100 routing decisions, optimize tier thresholds

**Execution:**

```
Step 1: Load routing data
  Read: .claude/agent-performance.json
  Focus: routing.tierDistribution, routing.decisions

Step 2: Analyze tier distribution

  Current distribution:
    tier1: 45 tasks (45.9%)
    tier2: 30 tasks (30.6%)
    tier3: 15 tasks (15.3%)
    tier4: 8 tasks (8.2%)

  Skew analysis:
    Heavy on tier1 (46%) - Router prefers simple classification
    Light on tier4 (8%) - Router rarely escalates

Step 3: Calculate tier success rates

  tier1 (Simple tasks):
    Total: 45
    Success: 42, Failure: 3
    Success rate: 93.3% ✅
    Avg cost: $0.001
    Avg duration: 25s

  tier2 (Medium tasks):
    Total: 30
    Success: 25, Failure: 5
    Success rate: 83.3% ⚠️ (target: 90%)
    Avg cost: $0.002
    Avg duration: 45s

  tier3 (Complex tasks):
    Total: 15
    Success: 12, Failure: 3
    Success rate: 80.0% ⚠️ (target: 90%)
    Avg cost: $0.004
    Avg duration: 90s

  tier4 (Critical tasks):
    Total: 8
    Success: 8, Failure: 0
    Success rate: 100% ✅
    Avg cost: $0.008
    Avg duration: 120s

Step 4: Analyze tier2/tier3 failures

  tier2 failures (5 tasks):
    1. "Implement complex state management" (complexity: 6)
       - Should have been tier3 (underestimated)
    2. "Add authentication to API" (complexity: 6)
       - Should have been tier3 (security = critical)
    3. "Refactor component with hooks" (complexity: 5)
       - Should have been tier2 (correctly routed, agent issue)
    4. "Implement drag-and-drop" (complexity: 6)
       - Should have been tier3 (complex interaction)
    5. "Add real-time updates" (complexity: 6)
       - Should have been tier3 (WebSocket complexity)

  Pattern: 4/5 failures were borderline tier2/tier3 (complexity 6)
  Root cause: tier2 upper threshold too high (should be 5, not 6)

  tier3 failures (3 tasks):
    1. "Design microservices architecture" (complexity: 9)
       - Should have been tier4 (architecture = critical)
    2. "Implement payment processing" (complexity: 9)
       - Should have been tier4 (money = critical)
    3. "Refactor authentication system" (complexity: 8)
       - Correctly routed, agent struggled with complexity

  Pattern: 2/3 failures should have been tier4 (complexity 9)
  Root cause: tier3 upper threshold too high (should be 8, not 9)

Step 5: Propose threshold adjustments

  Current thresholds:
    tier1: complexity 0-3
    tier2: complexity 4-6
    tier3: complexity 7-9
    tier4: complexity 10+

  Problem: Borderline tasks (6, 9) cause failures

  Optimized thresholds:
    tier1: complexity 0-2 (narrower, more confident)
    tier2: complexity 3-5 (shift down, avoid borderline 6)
    tier3: complexity 6-8 (shift down, avoid borderline 9)
    tier4: complexity 9+ (broader, include borderline cases)

  Rationale:
    - Move borderline complexity 6 from tier2 → tier3
    - Move borderline complexity 9 from tier3 → tier4
    - Accept 15% higher cost for 10% better success rate

Step 6: Simulate new distribution

  Reclassify historical tasks with new thresholds:

  tier1 (0-2): 35 tasks (35%)
    Success rate: 34/35 = 97.1% ↑ (was 93.3%)

  tier2 (3-5): 32 tasks (32%)
    Success rate: 30/32 = 93.8% ↑ (was 83.3%)

  tier3 (6-8): 18 tasks (18%)
    Success rate: 17/18 = 94.4% ↑ (was 80.0%)

  tier4 (9+): 13 tasks (13%)
    Success rate: 13/13 = 100% ✓ (was 100%)

  Overall success rate: 94/98 = 95.9% ↑ (was 87.8%)

Step 7: Calculate cost impact

  Current avg cost: $0.00240 per task
  Optimized avg cost: $0.00276 per task (+15%)

  Cost breakdown:
    tier1 (35%): $0.001 × 0.35 = $0.00035
    tier2 (32%): $0.002 × 0.32 = $0.00064
    tier3 (18%): $0.004 × 0.18 = $0.00072
    tier4 (13%): $0.008 × 0.13 = $0.00104
    Total: $0.00275 (rounded $0.00276)

  ROI calculation:
    Cost increase: +$0.00036 per task (+15%)
    Success increase: +8.1% (from 87.8% to 95.9%)
    Failure reduction: 12 → 4 failures (67% reduction)

  Value: Preventing 8 failures per 100 tasks worth the 15% cost increase

Step 8: Implement new thresholds

  Update task-complexity-router skill:
    OLD:
      if (complexity <= 3) return "tier1";
      if (complexity <= 6) return "tier2";
      if (complexity <= 9) return "tier3";
      return "tier4";

    NEW:
      if (complexity <= 2) return "tier1";
      if (complexity <= 5) return "tier2";
      if (complexity <= 8) return "tier3";
      return "tier4";

  Document change:
    Reason: Performance data showed borderline tasks caused failures
    Expected: 8% success rate improvement, 15% cost increase
    Monitoring: Track next 100 tasks to validate improvement

Step 9: Monitor post-optimization

  After 50 tasks with new thresholds:
    tier1: 18 tasks, 18 success = 100% ✅
    tier2: 16 tasks, 15 success = 93.8% ✅
    tier3: 10 tasks, 9 success = 90.0% ✅
    tier4: 6 tasks, 6 success = 100% ✅

  Overall: 48/50 = 96.0% success ✅ (matches projection)
  Avg cost: $0.00280 ✅ (matches projection)

  Verdict: Optimization successful, keep new thresholds
```

**Insights from Optimization:**

```
Key findings:
  1. Borderline complexity scores (6, 9) caused most failures
  2. Router was too aggressive in keeping tasks at lower tiers
  3. Small threshold adjustments (6→5, 9→8) had big impact

Optimization results:
  - Success rate: 87.8% → 96.0% (+8.2%)
  - Failure rate: 12.2% → 4.0% (-67%)
  - Cost per task: $0.00240 → $0.00280 (+15%)
  - ROI: Strong (quality improvement worth cost increase)

Lessons learned:
  - Track tier success rates, not just overall success
  - Borderline cases benefit from tier escalation
  - Performance data reveals routing blind spots
  - Continuous monitoring enables iterative improvement

Next steps:
  - Continue monitoring for 100 more tasks
  - Consider dynamic thresholds (adjust based on live data)
  - Explore agent-specific routing (some agents handle complexity better)
```

---

## Troubleshooting

**Problem: Agent performance.json file growing too large**

**Cause:** History arrays not being trimmed

**Solution:** Implement automatic trimming after each update

```javascript
function updateAgentMetrics(agentId, execution) {
  const agent = metrics.agents[agentId];

  // Update aggregates
  agent.totalRuns += 1;
  agent.successCount += execution.result === "success" ? 1 : 0;

  // Add to history
  agent.history.push(execution);

  // Trim to max 100 entries (FIFO)
  if (agent.history.length > 100) {
    agent.history = agent.history.slice(-100);
  }
}
```

---

**Problem: Metrics don't reflect recent changes**

**Cause:** Stale data from old workflows

**Solution:** Reset metrics after major changes

```bash
# Backup current metrics
cp .claude/agent-performance.json .claude/agent-performance-backup-$(date +%Y%m%d).json

# Reset relevant sections (keep models, reset agents)
# Edit .claude/agent-performance.json manually or use script
```

---

**Problem: Success rate calculations seem wrong**

**Cause:** Inconsistent result values ("success", "SUCCESS", "completed", etc.)

**Solution:** Normalize result values

```javascript
function normalizeResult(result) {
  const successValues = ["success", "SUCCESS", "completed", "PASS"];
  const failureValues = ["failure", "FAILURE", "error", "ERROR", "FAIL"];

  if (successValues.includes(result)) return "success";
  if (failureValues.includes(result)) return "failure";
  return "unknown";
}

// Use normalized values in metrics
const normalizedResult = normalizeResult(execution.result);
agent.successCount += normalizedResult === "success" ? 1 : 0;
agent.failureCount += normalizedResult === "failure" ? 1 : 0;
```

---

## Summary

Performance tracking enables **data-driven orchestration optimization** through:

- **Agent success tracking** (identify high-performers and underperformers)
- **Model performance comparison** (find cost-effective alternatives)
- **Skill effectiveness analysis** (discover successful patterns)
- **Routing optimization** (adjust tier thresholds based on actual results)
- **Historical trend detection** (alert on degradation, celebrate improvements)

Key metrics to monitor:
- Agent success rate (target >70%, alert if <60%)
- Model cost-effectiveness (cost per success, not just cost per task)
- Routing tier accuracy (target >90% success per tier)
- Skill activation correlation (identify high-value skills)

Master performance tracking and your orchestration workflows will continuously improve, delivering better results at lower costs.

---

**Inspired By:**
- `/review` command (multi-model performance tracking)
- `/dev` command (agent success rate monitoring)
- task-complexity-router skill (routing feedback loops)
- Production workflows (cost optimization, quality tracking)
Related Skills

performance-correlation

248
from MadAppGang/claude-code
Correlate content attributes with GA4 and GSC metrics to identify performance drivers
model-tracking-protocol

248
from MadAppGang/claude-code
MANDATORY tracking protocol for multi-model validation. Creates structured tracking tables BEFORE launching models, tracks progress during execution, and ensures complete results presentation. Use when running 2+ external AI models in parallel. Trigger keywords - "multi-model", "parallel review", "external models", "consensus", "model tracking".
performance-security

248
from MadAppGang/claude-code
Use when optimizing performance or reviewing security. Covers code-splitting, React Compiler patterns, asset optimization, a11y testing, and security hardening for React apps.
golang-performance

248
from MadAppGang/claude-code
Use when profiling Go applications (pprof), running benchmarks, optimizing memory/CPU usage, or debugging performance bottlenecks in production Go code.
test-skill

248
from MadAppGang/claude-code
A test skill for validation testing. Use when testing skill parsing and validation logic.
bad-skill

248
from MadAppGang/claude-code
This skill has invalid YAML in frontmatter
release

248
from MadAppGang/claude-code
Plugin release process for MAG Claude Plugins marketplace. Covers version bumping, marketplace.json updates, git tagging, and common mistakes. Use when releasing new plugin versions or troubleshooting update issues.
openrouter-trending-models

248
from MadAppGang/claude-code
Fetch trending programming models from OpenRouter rankings. Use when selecting models for multi-model review, updating model recommendations, or researching current AI coding trends. Provides model IDs, context windows, pricing, and usage statistics from the most recent week.
Claudish Integration Skill

248
from MadAppGang/claude-code
**Version:** 1.0.0
transcription

248
from MadAppGang/claude-code
Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.
final-cut-pro

248
from MadAppGang/claude-code
Apple Final Cut Pro FCPXML format reference. Covers project structure, timeline creation, clip references, effects, and transitions. Use when generating FCP projects or understanding FCPXML structure.
ffmpeg-core

248
from MadAppGang/claude-code
FFmpeg fundamentals for video/audio manipulation. Covers common operations (trim, concat, convert, extract), codec selection, filter chains, and performance optimization. Use when planning or executing video processing tasks.