error-recovery

Strategies for handling subagent failures with retry logic and escalation patterns.

242 stars

Best use case

error-recovery is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Strategies for handling subagent failures with retry logic and escalation patterns.

Strategies for handling subagent failures with retry logic and escalation patterns.

Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.

Practical example

Example input

Use the "error-recovery" skill to help with this workflow task. Context: Strategies for handling subagent failures with retry logic and escalation patterns.

Example output

A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.

When to use this skill

  • Use this skill when you want a reusable workflow rather than writing the same prompt again and again.

When not to use this skill

  • Do not use this when you only need a one-off answer and do not need a reusable workflow.
  • Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/error-recovery/SKILL.md --create-dirs "https://raw.githubusercontent.com/aiskillstore/marketplace/main/skills/clouder0/error-recovery/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/error-recovery/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How error-recovery Compares

Feature / Agenterror-recoveryStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Strategies for handling subagent failures with retry logic and escalation patterns.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Error Recovery Skill

Pattern for handling subagent failures gracefully with appropriate retry strategies.

## When to Load This Skill

- You are spawning subagents that may fail
- A subagent returned an error or unexpected output
- You need to decide whether to retry, escalate, or abort

## Failure Categories

| Category | Symptoms | Strategy |
|----------|----------|----------|
| **Transient** | Timeout, malformed output, parsing error | Simple Retry |
| **Context Gap** | "I don't have enough information", unclear task | Context Enhancement |
| **Complexity** | Partial completion, scope creep, tangents | Scope Reduction |
| **Boundary/Contract** | `status: blocked`, boundary_violation, contract_change | Escalation |
| **Fatal** | Repeated failures (3+), fundamental misunderstanding | Abort with Report |

## Retry Strategies

### Strategy 1: Simple Retry

For transient failures. Same prompt, up to 3 attempts.

```
# Track attempts
attempts: 0
max_attempts: 3

# On failure
IF attempts < max_attempts:
  attempts += 1
  Task(same_subagent_type, same_model, same_prompt)
ELSE:
  Mark as FAILED, move on
```

**Use when:**
- Output was malformed or truncated
- Timeout occurred
- Agent returned empty/null response

### Strategy 2: Context Enhancement

Add more information to help the agent succeed.

```
Task(
  subagent_type: "implementer",
  model: "sonnet",
  prompt: |
    ## PREVIOUS ATTEMPT FAILED

    Error: {error_message}
    Output received: {partial_output}

    ## ADDITIONAL CONTEXT

    Here is more information that may help:
    - Related file: @{additional_file_path}
    - Pattern to follow: {example_pattern}
    - Specific guidance: {clarification}

    ## ORIGINAL TASK

    {original_task_description}

    Output to: {output_path}
)
```

**Use when:**
- Agent said "I don't understand" or "unclear requirements"
- Agent made incorrect assumptions
- Agent asked questions in output

**Context to add:**
- Related code files the agent might need
- Similar implementations as examples
- Explicit clarification of ambiguous points
- Error message from previous attempt

### Strategy 3: Scope Reduction

Break the failing task into smaller, more manageable pieces.

```
# Original task failed
Task: "Implement full authentication system"

# Split into subtasks
Task(implementer, "Implement password hashing utility")
Task(implementer, "Implement session token generation")
Task(implementer, "Implement login endpoint")
Task(implementer, "Implement logout endpoint")
```

**Use when:**
- Agent completed partial work then failed
- Task description was too broad
- Agent went off on tangents
- Output shows confusion about scope

**Splitting guidelines:**
- Each subtask should be independently completable
- Each subtask should have clear boundaries
- Subtasks can run in parallel if no dependencies
- Recombine outputs after all subtasks complete

### Strategy 4: Escalation

Route to specialized agent for resolution.

```
# For boundary violations
Task(
  subagent_type: "contract-resolver",
  model: "sonnet",
  prompt: |
    A task is blocked due to boundary/contract issues.

    Blocked task output: memory/tasks/{task_id}/output.json
    Blocked reason: {blocked_reason}
    Current contracts: {contract_paths}

    Analyze impact and provide resolution.
    Output to: memory/contracts/resolution_{task_id}.json
)
```

**Escalation paths:**

| Failure Type | Escalate To | Action |
|--------------|-------------|--------|
| `blocked_reason: boundary_violation` | contract-resolver | Expand boundaries or redesign |
| `blocked_reason: contract_change` | contract-resolver | Modify contract, re-verify dependents |
| `blocked_reason: dependency_issue` | executor (self) | Re-check dependency status |
| Repeated implementation failures | architect | Reconsider design approach |

### Strategy 5: Abort with Report

When recovery is not possible, fail gracefully.

```json
{"tasks":[{"id":"{task_id}","status":"failed","failure_reason":"{specific reason}","attempts_made":3,"recovery_attempted":[{"strategy":"simple_retry","result":"same_error"},{"strategy":"context_enhancement","result":"different_error"},{"strategy":"scope_reduction","result":"subtasks_also_failed"}],"recommendation":"Task may need architectural redesign"}]}
```

**Use when:**
- 3+ retry attempts failed
- Different strategies all failed
- Fundamental misunderstanding of requirements
- Task is actually impossible given constraints

## Decision Tree

```
On Subagent Failure:
│
├─ Is output malformed/empty/timeout?
│  └─ YES → Strategy 1: Simple Retry (up to 3x)
│
├─ Did agent say "unclear" or ask questions?
│  └─ YES → Strategy 2: Context Enhancement
│
├─ Did agent complete partial work?
│  └─ YES → Strategy 3: Scope Reduction
│
├─ Is status "blocked" with boundary/contract reason?
│  └─ YES → Strategy 4: Escalation to contract-resolver
│
├─ Have we tried 3+ strategies already?
│  └─ YES → Strategy 5: Abort with Report
│
└─ Unknown error
   └─ Try Strategy 2 first, then escalate
```

## Retry State Tracking

Track retry attempts in the execution state file:

```json
{"tasks":[{"id":"task-001","status":"running","attempts":2,"last_error":"Timeout after 120s","retry_strategy":"simple_retry"},{"id":"task-002","status":"running","attempts":1,"last_error":"Needs access to src/config/db.ts","retry_strategy":"context_enhancement","context_added":["src/config/db.ts","src/types/config.ts"]}]}
```

## Integration with Executor Loop

```
# Enhanced execution loop
WHILE tasks remain incomplete:
  1. Read state file
  2. Find ready tasks
  3. Spawn ready tasks
  4. Check completed tasks:
     FOR each completed task:
       IF status == pre_complete:
         spawn verifier
       ELIF status == blocked:
         apply Strategy 4 (Escalation)
       ELIF status == failed:
         determine_failure_category()
         apply_appropriate_strategy()
         update_retry_state()
  5. Update state file
  6. IF all verified: EXIT
  7. IF all failed with no recovery: EXIT with failure report
```

## Principles

1. **Fail fast, recover smart** - Don't retry blindly; analyze the failure first
2. **Preserve partial work** - If agent completed 50%, don't discard it
3. **Escalate early** - Boundary/contract issues need resolver, not retries
4. **Track everything** - Log all attempts for reflection phase
5. **Know when to quit** - 3 failed strategies = abort, don't loop forever

Related Skills

fp-ts-errors

242
from aiskillstore/marketplace

Handle errors as values using fp-ts Either and TaskEither for cleaner, more predictable TypeScript code. Use when implementing error handling patterns with fp-ts.

error-handling-patterns

242
from aiskillstore/marketplace

Master error handling patterns across languages including exceptions, Result types, error propagation, and graceful degradation to build resilient applications. Use when implementing error handling, designing APIs, or improving application reliability.

error-diagnostics-smart-debug

242
from aiskillstore/marketplace

Use when working with error diagnostics smart debug

error-diagnostics-error-trace

242
from aiskillstore/marketplace

You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging,

error-diagnostics-error-analysis

242
from aiskillstore/marketplace

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

error-debugging-multi-agent-review

242
from aiskillstore/marketplace

Use when working with error debugging multi agent review

error-debugging-error-trace

242
from aiskillstore/marketplace

You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging, and ensure teams can quickly identify and resolve production issues.

error-debugging-error-analysis

242
from aiskillstore/marketplace

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

test-build-error

242
from aiskillstore/marketplace

Tests error visibility for build step failures

thiserror-expert

242
from aiskillstore/marketplace

Provides guidance on creating custom error types with thiserror, including proper derive macros, error messages, and source error chaining. Activates when users define error enums or work with thiserror.

error-handler-advisor

242
from aiskillstore/marketplace

Proactively reviews error handling patterns and suggests improvements using Result types, proper error propagation, and idiomatic patterns. Activates when users write error handling code or use unwrap/expect.

error-conversion-guide

242
from aiskillstore/marketplace

Guides users on error conversion patterns, From trait implementations, and the ? operator. Activates when users need to convert between error types or handle multiple error types in a function.