error-recovery

Strategies for handling subagent failures with retry logic and escalation patterns.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

error-recovery is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Strategies for handling subagent failures with retry logic and escalation patterns.

Teams using error-recovery should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/error-recovery/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/aiskillstore/marketplace/clouder0/error-recovery/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/error-recovery/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How error-recovery Compares

Feature / Agent	error-recovery	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Strategies for handling subagent failures with retry logic and escalation patterns.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Error Recovery Skill

Pattern for handling subagent failures gracefully with appropriate retry strategies.

## When to Load This Skill

- You are spawning subagents that may fail
- A subagent returned an error or unexpected output
- You need to decide whether to retry, escalate, or abort

## Failure Categories

| Category | Symptoms | Strategy |
|----------|----------|----------|
| **Transient** | Timeout, malformed output, parsing error | Simple Retry |
| **Context Gap** | "I don't have enough information", unclear task | Context Enhancement |
| **Complexity** | Partial completion, scope creep, tangents | Scope Reduction |
| **Boundary/Contract** | `status: blocked`, boundary_violation, contract_change | Escalation |
| **Fatal** | Repeated failures (3+), fundamental misunderstanding | Abort with Report |

## Retry Strategies

### Strategy 1: Simple Retry

For transient failures. Same prompt, up to 3 attempts.

```
# Track attempts
attempts: 0
max_attempts: 3

# On failure
IF attempts < max_attempts:
  attempts += 1
  Task(same_subagent_type, same_model, same_prompt)
ELSE:
  Mark as FAILED, move on
```

**Use when:**
- Output was malformed or truncated
- Timeout occurred
- Agent returned empty/null response

### Strategy 2: Context Enhancement

Add more information to help the agent succeed.

```
Task(
  subagent_type: "implementer",
  model: "sonnet",
  prompt: |
    ## PREVIOUS ATTEMPT FAILED

    Error: {error_message}
    Output received: {partial_output}

    ## ADDITIONAL CONTEXT

    Here is more information that may help:
    - Related file: @{additional_file_path}
    - Pattern to follow: {example_pattern}
    - Specific guidance: {clarification}

    ## ORIGINAL TASK

    {original_task_description}

    Output to: {output_path}
)
```

**Use when:**
- Agent said "I don't understand" or "unclear requirements"
- Agent made incorrect assumptions
- Agent asked questions in output

**Context to add:**
- Related code files the agent might need
- Similar implementations as examples
- Explicit clarification of ambiguous points
- Error message from previous attempt

### Strategy 3: Scope Reduction

Break the failing task into smaller, more manageable pieces.

```
# Original task failed
Task: "Implement full authentication system"

# Split into subtasks
Task(implementer, "Implement password hashing utility")
Task(implementer, "Implement session token generation")
Task(implementer, "Implement login endpoint")
Task(implementer, "Implement logout endpoint")
```

**Use when:**
- Agent completed partial work then failed
- Task description was too broad
- Agent went off on tangents
- Output shows confusion about scope

**Splitting guidelines:**
- Each subtask should be independently completable
- Each subtask should have clear boundaries
- Subtasks can run in parallel if no dependencies
- Recombine outputs after all subtasks complete

### Strategy 4: Escalation

Route to specialized agent for resolution.

```
# For boundary violations
Task(
  subagent_type: "contract-resolver",
  model: "sonnet",
  prompt: |
    A task is blocked due to boundary/contract issues.

    Blocked task output: memory/tasks/{task_id}/output.json
    Blocked reason: {blocked_reason}
    Current contracts: {contract_paths}

    Analyze impact and provide resolution.
    Output to: memory/contracts/resolution_{task_id}.json
)
```

**Escalation paths:**

| Failure Type | Escalate To | Action |
|--------------|-------------|--------|
| `blocked_reason: boundary_violation` | contract-resolver | Expand boundaries or redesign |
| `blocked_reason: contract_change` | contract-resolver | Modify contract, re-verify dependents |
| `blocked_reason: dependency_issue` | executor (self) | Re-check dependency status |
| Repeated implementation failures | architect | Reconsider design approach |

### Strategy 5: Abort with Report

When recovery is not possible, fail gracefully.

```json
{"tasks":[{"id":"{task_id}","status":"failed","failure_reason":"{specific reason}","attempts_made":3,"recovery_attempted":[{"strategy":"simple_retry","result":"same_error"},{"strategy":"context_enhancement","result":"different_error"},{"strategy":"scope_reduction","result":"subtasks_also_failed"}],"recommendation":"Task may need architectural redesign"}]}
```

**Use when:**
- 3+ retry attempts failed
- Different strategies all failed
- Fundamental misunderstanding of requirements
- Task is actually impossible given constraints

## Decision Tree

```
On Subagent Failure:
│
├─ Is output malformed/empty/timeout?
│  └─ YES → Strategy 1: Simple Retry (up to 3x)
│
├─ Did agent say "unclear" or ask questions?
│  └─ YES → Strategy 2: Context Enhancement
│
├─ Did agent complete partial work?
│  └─ YES → Strategy 3: Scope Reduction
│
├─ Is status "blocked" with boundary/contract reason?
│  └─ YES → Strategy 4: Escalation to contract-resolver
│
├─ Have we tried 3+ strategies already?
│  └─ YES → Strategy 5: Abort with Report
│
└─ Unknown error
   └─ Try Strategy 2 first, then escalate
```

## Retry State Tracking

Track retry attempts in the execution state file:

```json
{"tasks":[{"id":"task-001","status":"running","attempts":2,"last_error":"Timeout after 120s","retry_strategy":"simple_retry"},{"id":"task-002","status":"running","attempts":1,"last_error":"Needs access to src/config/db.ts","retry_strategy":"context_enhancement","context_added":["src/config/db.ts","src/types/config.ts"]}]}
```

## Integration with Executor Loop

```
# Enhanced execution loop
WHILE tasks remain incomplete:
  1. Read state file
  2. Find ready tasks
  3. Spawn ready tasks
  4. Check completed tasks:
     FOR each completed task:
       IF status == pre_complete:
         spawn verifier
       ELIF status == blocked:
         apply Strategy 4 (Escalation)
       ELIF status == failed:
         determine_failure_category()
         apply_appropriate_strategy()
         update_retry_state()
  5. Update state file
  6. IF all verified: EXIT
  7. IF all failed with no recovery: EXIT with failure report
```

## Principles

1. **Fail fast, recover smart** - Don't retry blindly; analyze the failure first
2. **Preserve partial work** - If agent completed 50%, don't discard it
3. **Escalate early** - Boundary/contract issues need resolver, not retries
4. **Track everything** - Log all attempts for reflection phase
5. **Know when to quit** - 3 failed strategies = abort, don't loop forever

Related Skills

planning-disaster-recovery

from ComeOnOliver/skillshub

Execute use when you need to work with backup and recovery. This skill provides backup automation and disaster recovery with comprehensive guidance and automation. Trigger with phrases like "create backups", "automate backups", or "implement disaster recovery".

monitoring-error-rates

from ComeOnOliver/skillshub

Monitor and analyze application error rates to improve reliability. Use when tracking errors in applications including HTTP errors, exceptions, and database issues. Trigger with phrases like "monitor error rates", "track application errors", or "analyze error patterns".

managing-database-recovery

from ComeOnOliver/skillshub

Process use when you need to work with database operations. This skill provides database management and optimization with comprehensive guidance and automation. Trigger with phrases like "manage database", "optimize database", or "configure database".

fathom-common-errors

from ComeOnOliver/skillshub

Diagnose and fix Fathom API errors including auth failures and missing data. Use when API calls fail, transcripts are empty, or webhooks are not firing. Trigger with phrases like "fathom error", "fathom not working", "fathom api failure", "fix fathom".

exa-common-errors

from ComeOnOliver/skillshub

Diagnose and fix Exa API errors by HTTP code and error tag. Use when encountering Exa errors, debugging failed requests, or troubleshooting integration issues. Trigger with phrases like "exa error", "fix exa", "exa not working", "debug exa", "exa 429", "exa 401".

evernote-common-errors

from ComeOnOliver/skillshub

Diagnose and fix common Evernote API errors. Use when encountering Evernote API exceptions, debugging failures, or troubleshooting integration issues. Trigger with phrases like "evernote error", "evernote exception", "fix evernote issue", "debug evernote", "evernote troubleshooting".

error-mapping-helper

from ComeOnOliver/skillshub

Error Mapping Helper - Auto-activating skill for API Integration. Triggers on: error mapping helper, error mapping helper Part of the API Integration skill category.

error-handler-middleware

from ComeOnOliver/skillshub

Error Handler Middleware - Auto-activating skill for Backend Development. Triggers on: error handler middleware, error handler middleware Part of the Backend Development skill category.

elevenlabs-common-errors

from ComeOnOliver/skillshub

Diagnose and fix ElevenLabs API errors by HTTP status code. Use when encountering ElevenLabs errors, debugging failed TTS/STS requests, or troubleshooting voice cloning and streaming issues. Trigger: "elevenlabs error", "fix elevenlabs", "elevenlabs not working", "debug elevenlabs", "elevenlabs 401", "elevenlabs 429", "elevenlabs 400".

documenso-common-errors

from ComeOnOliver/skillshub

Diagnose and resolve common Documenso API errors and issues. Use when encountering Documenso errors, debugging integration issues, or troubleshooting failed operations. Trigger with phrases like "documenso error", "documenso 401", "documenso failed", "fix documenso", "documenso not working".

deepgram-common-errors

from ComeOnOliver/skillshub

Diagnose and fix common Deepgram errors and issues. Use when troubleshooting Deepgram API errors, debugging transcription failures, or resolving integration issues. Trigger: "deepgram error", "deepgram not working", "fix deepgram", "deepgram troubleshoot", "transcription failed", "deepgram 401".

cursor-common-errors

from ComeOnOliver/skillshub

Troubleshoot common Cursor IDE errors: authentication, completion, indexing, API, and performance issues. Triggers on "cursor error", "cursor not working", "cursor issue", "cursor problem", "fix cursor", "cursor crash".