error-recovery
Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure.
Best use case
error-recovery is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure.
Teams using error-recovery should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/error-recovery/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How error-recovery Compares
| Feature / Agent | error-recovery | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Standard recovery patterns for all squad agents. When something fails, adapt — don't just report the failure.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Error Recovery Patterns Standard recovery patterns for all squad agents. When something fails, **adapt** — don't just report the failure. --- ## 1. Retry with Backoff **When:** Transient failures — API timeouts, rate limits, network errors, temporary service unavailability. **Pattern:** 1. Wait briefly, then retry (start at 2s, double each attempt) 2. Maximum 3 retries before escalating 3. Log each attempt with the error received **Example:** API call returns 429 Too Many Requests → wait 2s → retry → wait 4s → retry → wait 8s → retry → escalate if still failing. --- ## 2. Fallback Alternatives **When:** Primary tool or approach fails and an alternative exists. **Pattern:** 1. Attempt primary approach 2. On failure, identify alternative tool/method 3. Try the alternative with the same intent 4. Document which alternative was used and why **Example:** Primary CLI tool fails → fall back to direct API call for the same operation. --- ## 3. Diagnose-and-Fix **When:** Build failures, test failures, linting errors — structured errors with actionable output. **Pattern:** 1. Read the full error output carefully 2. Identify the root cause from error messages 3. Attempt a targeted fix 4. Re-run to verify the fix 5. Maximum 3 fix-retry cycles before escalating **Example:** Build fails with a type error → check for missing import → add it → rebuild. --- ## 4. Escalate with Context **When:** Recovery attempts have been exhausted, or the failure requires human judgment. **Pattern:** 1. Summarize what was attempted and what failed 2. Include the exact error messages 3. State what you believe the root cause is 4. Suggest next steps or who might be able to help 5. Hand off to the coordinator or the appropriate specialist **Example:** After 3 failed build attempts → "Build fails on line 42 with null reference. Tried X, Y, Z. Likely a design issue in the Foo module. Recommend the code owner review." --- ## 5. Graceful Degradation **When:** A non-critical step fails but the overall task can still deliver value. **Pattern:** 1. Determine if the failed step is critical to the task outcome 2. If non-critical, log the failure and continue 3. Deliver partial results with a clear note of what was skipped 4. Offer to retry the skipped step separately **Example:** Generating a report with 5 sections — section 3 data source is unavailable → produce the report with 4 sections, note that section 3 was skipped and why. --- ## Applying These Patterns Each agent should reference these patterns in their charter's `## Error Recovery` section, tailored to their domain. The charter should list the agent's most common failure modes and map each to the appropriate pattern above. **Selection guide:** | Failure Type | Primary Pattern | Fallback Pattern | |---|---|---| | Network/API transient | Retry with Backoff | Escalate with Context | | Tool/dependency missing | Fallback Alternatives | Escalate with Context | | Build/test error | Diagnose-and-Fix | Escalate with Context | | Auth/permissions | Retry with Backoff | Escalate with Context | | Non-critical data missing | Graceful Degradation | — | | Unknown/novel error | Escalate with Context | — |
Related Skills
session-recovery
Find and resume interrupted Copilot CLI sessions using session_store queries
My Skill
No description provided.
rework-rate
Measure and interpret PR rework rate — the emerging 5th DORA metric
project-conventions
Core conventions and patterns for this codebase
tiered-memory
Three-tier agent memory model (hot/cold/wiki) for 20-55% context reduction per spawn
test-discipline
Update tests when changing APIs — no exceptions
Skill: Retro Enforcement
## Purpose
reflect
Learning capture system that extracts HIGH/MED/LOW confidence patterns from conversations to prevent repeating mistakes. Use after user corrections ("no", "wrong"), praise ("perfect", "exactly"), or when discovering edge cases. Complements .squad/agents/{agent}/history.md and .squad/decisions.md.
notification-routing
Route agent notifications to specific channels by type — prevent alert fatigue from single-channel flooding
iterative-retrieval
Max-3-cycle protocol for agent sub-tasks with WHY context and coordinator validation. Use when spawning sub-agents to complete scoped work.
docs-standards
Microsoft Style Guide + Squad-specific documentation patterns
{skill-name}
{what this skill teaches agents}