error-recovery

Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure.

1,828 stars

bybradygaster

View on GitHub Installation ↓

Best use case

error-recovery is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure.

Teams using error-recovery should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/error-recovery/SKILL.md --create-dirs "https://raw.githubusercontent.com/bradygaster/squad/main/packages/squad-cli/templates/skills/error-recovery/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/error-recovery/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How error-recovery Compares

Feature / Agent	error-recovery	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Error Recovery Patterns

Standard recovery patterns for all squad agents. When something fails, **adapt** — don't just report the failure.

---

## 1. Retry with Backoff

**When:** Transient failures — API timeouts, rate limits, network errors, temporary service unavailability.

**Pattern:**
1. Wait briefly, then retry (start at 2s, double each attempt)
2. Maximum 3 retries before escalating
3. Log each attempt with the error received

**Example:** API call returns 429 Too Many Requests → wait 2s → retry → wait 4s → retry → wait 8s → retry → escalate if still failing.

---

## 2. Fallback Alternatives

**When:** Primary tool or approach fails and an alternative exists.

**Pattern:**
1. Attempt primary approach
2. On failure, identify alternative tool/method
3. Try the alternative with the same intent
4. Document which alternative was used and why

**Example:** Primary CLI tool fails → fall back to direct API call for the same operation.

---

## 3. Diagnose-and-Fix

**When:** Build failures, test failures, linting errors — structured errors with actionable output.

**Pattern:**
1. Read the full error output carefully
2. Identify the root cause from error messages
3. Attempt a targeted fix
4. Re-run to verify the fix
5. Maximum 3 fix-retry cycles before escalating

**Example:** Build fails with a type error → check for missing import → add it → rebuild.

---

## 4. Escalate with Context

**When:** Recovery attempts have been exhausted, or the failure requires human judgment.

**Pattern:**
1. Summarize what was attempted and what failed
2. Include the exact error messages
3. State what you believe the root cause is
4. Suggest next steps or who might be able to help
5. Hand off to the coordinator or the appropriate specialist

**Example:** After 3 failed build attempts → "Build fails on line 42 with null reference. Tried X, Y, Z. Likely a design issue in the Foo module. Recommend the code owner review."

---

## 5. Graceful Degradation

**When:** A non-critical step fails but the overall task can still deliver value.

**Pattern:**
1. Determine if the failed step is critical to the task outcome
2. If non-critical, log the failure and continue
3. Deliver partial results with a clear note of what was skipped
4. Offer to retry the skipped step separately

**Example:** Generating a report with 5 sections — section 3 data source is unavailable → produce the report with 4 sections, note that section 3 was skipped and why.

---

## Applying These Patterns

Each agent should reference these patterns in their charter's `## Error Recovery` section, tailored to their domain. The charter should list the agent's most common failure modes and map each to the appropriate pattern above.

1828

from bradygaster/squad

{what this skill teaches agents}