resilience-analysis

Assess error handling, isolation boundaries, and recovery mechanisms in agent frameworks. Use when (1) tracing error propagation paths, (2) evaluating sandboxing for code execution, (3) understanding retry and fallback mechanisms, (4) assessing production readiness, or (5) identifying failure modes and recovery patterns.

242 stars

Best use case

resilience-analysis is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Assess error handling, isolation boundaries, and recovery mechanisms in agent frameworks. Use when (1) tracing error propagation paths, (2) evaluating sandboxing for code execution, (3) understanding retry and fallback mechanisms, (4) assessing production readiness, or (5) identifying failure modes and recovery patterns.

Assess error handling, isolation boundaries, and recovery mechanisms in agent frameworks. Use when (1) tracing error propagation paths, (2) evaluating sandboxing for code execution, (3) understanding retry and fallback mechanisms, (4) assessing production readiness, or (5) identifying failure modes and recovery patterns.

Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.

Practical example

Example input

Use the "resilience-analysis" skill to help with this workflow task. Context: Assess error handling, isolation boundaries, and recovery mechanisms in agent frameworks. Use when (1) tracing error propagation paths, (2) evaluating sandboxing for code execution, (3) understanding retry and fallback mechanisms, (4) assessing production readiness, or (5) identifying failure modes and recovery patterns.

Example output

A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.

When to use this skill

  • Use this skill when you want a reusable workflow rather than writing the same prompt again and again.

When not to use this skill

  • Do not use this when you only need a one-off answer and do not need a reusable workflow.
  • Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/resilience-analysis/SKILL.md --create-dirs "https://raw.githubusercontent.com/aiskillstore/marketplace/main/skills/dowwie/resilience-analysis/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/resilience-analysis/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How resilience-analysis Compares

Feature / Agentresilience-analysisStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Assess error handling, isolation boundaries, and recovery mechanisms in agent frameworks. Use when (1) tracing error propagation paths, (2) evaluating sandboxing for code execution, (3) understanding retry and fallback mechanisms, (4) assessing production readiness, or (5) identifying failure modes and recovery patterns.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Resilience Analysis

Assesses error handling and isolation boundaries.

## Process

1. **Trace error propagation** — Map exception flow from tools to agent
2. **Identify isolation** — Sandbox mechanisms for dangerous operations
3. **Catalog recovery** — Retry logic, fallbacks, circuit breakers
4. **Assess boundaries** — What crashes propagate vs. are contained

## Error Propagation Analysis

### Questions to Answer

1. Does a tool exception terminate the agent?
2. Are LLM API errors retried automatically?
3. Is parsing failure (malformed output) recoverable?
4. What happens when state updates fail?

### Propagation Patterns

**Crash Propagation (Dangerous)**
```python
def run_tool(self, tool, args):
    return tool.execute(args)  # Exception bubbles up
```

**Exception Wrapping**
```python
def run_tool(self, tool, args):
    try:
        return tool.execute(args)
    except Exception as e:
        raise ToolExecutionError(tool.name, e) from e
```

**Error Containment**
```python
def run_tool(self, tool, args):
    try:
        return ToolResult(success=True, output=tool.execute(args))
    except Exception as e:
        return ToolResult(success=False, error=str(e))
```

### Propagation Map Template

```
User Input
    ↓
┌─────────────────────────────────────────┐
│ Agent Loop                              │
│   ↓                                     │
│ ┌─────────────────────────────────────┐ │
│ │ LLM Call                            │ │
│ │ • APIError → [Retry 3x / Propagate] │ │
│ │ • RateLimit → [Backoff / Propagate] │ │
│ │ • Timeout → [Retry / Propagate]     │ │
│ └─────────────────────────────────────┘ │
│   ↓                                     │
│ ┌─────────────────────────────────────┐ │
│ │ Output Parsing                      │ │
│ │ • ParseError → [Retry / Contained]  │ │
│ │ • ValidationError → [Contained]     │ │
│ └─────────────────────────────────────┘ │
│   ↓                                     │
│ ┌─────────────────────────────────────┐ │
│ │ Tool Execution                      │ │
│ │ • ToolError → [Feedback to LLM]     │ │
│ │ • Timeout → [Kill / Continue]       │ │
│ │ • SecurityError → [Propagate]       │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────┘
```

## Sandboxing Mechanisms

### Code Execution Isolation

| Mechanism | Safety Level | Performance | Complexity |
|-----------|-------------|-------------|------------|
| None | ⚠️ Dangerous | Fast | None |
| RestrictedPython | Medium | Fast | Low |
| AST Validation | Low | Fast | Medium |
| Subprocess | Medium | Overhead | Low |
| Docker/Container | High | High overhead | Medium |
| gVisor/Firecracker | Very High | Medium overhead | High |

### Detection Patterns

**No Sandboxing**
```python
exec(user_code)  # Direct execution
eval(expression)  # Direct eval
subprocess.run(cmd, shell=True)  # Shell injection risk
```

**Basic Sandboxing**
```python
# RestrictedPython
from RestrictedPython import compile_restricted
code = compile_restricted(user_code, '<string>', 'exec')

# AST validation
tree = ast.parse(user_code)
if has_dangerous_nodes(tree):
    raise SecurityError()
```

**Process Isolation**
```python
# Subprocess with limits
result = subprocess.run(
    ['python', '-c', user_code],
    timeout=30,
    capture_output=True,
    user='nobody'  # Drop privileges
)
```

**Container Isolation**
```python
import docker
client = docker.from_env()
container = client.containers.run(
    'python:3.11-slim',
    command=['python', '-c', user_code],
    mem_limit='256m',
    network_disabled=True,
    remove=True
)
```

## Recovery Patterns

### Retry Logic

```python
# Simple retry
@retry(max_attempts=3, backoff=exponential)
def call_llm(self, prompt):
    return self.client.generate(prompt)

# Retry with error feedback
def call_with_retry(self, prompt, max_retries=3):
    errors = []
    for i in range(max_retries):
        try:
            return self.llm.generate(prompt)
        except ParseError as e:
            errors.append(str(e))
            prompt = f"{prompt}\n\nPrevious errors: {errors}"
    raise MaxRetriesExceeded(errors)
```

### Fallback Mechanisms

```python
def generate(self, prompt):
    try:
        return self.primary_llm.generate(prompt)
    except APIError:
        return self.fallback_llm.generate(prompt)
```

### Circuit Breaker

```python
class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failures = 0
        self.state = 'closed'
        self.last_failure = None
    
    def call(self, func, *args):
        if self.state == 'open':
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = 'half-open'
            else:
                raise CircuitOpen()
        
        try:
            result = func(*args)
            self.failures = 0
            self.state = 'closed'
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.failure_threshold:
                self.state = 'open'
            raise
```

## Output Template

```markdown
## Resilience Analysis: [Framework Name]

### Error Propagation Map

| Error Source | Error Type | Handling | Propagates? |
|--------------|-----------|----------|-------------|
| LLM API | RateLimitError | Retry 3x with backoff | No |
| LLM API | APIError | Retry 1x | Yes |
| Parser | ParseError | Feed back to LLM | No |
| Tool | Exception | Wrap and feed to LLM | No |
| Tool | Timeout | Kill process | No |
| State | ValidationError | Propagate | Yes |

### Sandboxing Assessment
- **Code Execution**: [Mechanism or None]
- **File System**: [Isolated/Restricted/Open]
- **Network**: [Blocked/Filtered/Open]
- **Resource Limits**: [Memory/CPU/Time limits]

### Recovery Mechanisms

| Pattern | Implementation | Location |
|---------|---------------|----------|
| Retry | Exponential backoff, 3 attempts | llm.py:L45 |
| Fallback | Secondary model | agent.py:L120 |
| Circuit Breaker | None | - |

### Risk Assessment
- **Critical Gaps**: [List any missing protections]
- **Production Ready**: [Yes/No/Needs work]
```

## Integration

- **Prerequisite**: `codebase-mapping` to identify execution code
- **Feeds into**: `antipattern-catalog` for error handling issues
- **Related**: `execution-engine-analysis` for async error handling

Related Skills

log-analysis

242
from aiskillstore/marketplace

Analyze application logs to identify errors, performance issues, and security anomalies. Use when debugging issues, monitoring system health, or investigating incidents. Handles various log formats including Apache, Nginx, application logs, and JSON logs.

wireshark-network-traffic-analysis

242
from aiskillstore/marketplace

This skill should be used when the user asks to "analyze network traffic with Wireshark", "capture packets for troubleshooting", "filter PCAP files", "follow TCP/UDP streams", "detect network anomalies", "investigate suspicious traffic", or "perform protocol analysis". It provides comprehensive techniques for network packet capture, filtering, and analysis using Wireshark.

wireshark-analysis

242
from aiskillstore/marketplace

This skill should be used when the user asks to "analyze network traffic with Wireshark", "capture packets for troubleshooting", "filter PCAP files", "follow TCP/UDP streams", "dete...

team-composition-analysis

242
from aiskillstore/marketplace

This skill should be used when the user asks to "plan team structure", "determine hiring needs", "design org chart", "calculate compensation", "plan equity allocation", or requests organizational design and headcount planning for a startup.

stride-analysis-patterns

242
from aiskillstore/marketplace

Apply STRIDE methodology to systematically identify threats. Use when analyzing system security, conducting threat modeling sessions, or creating security documentation.

market-sizing-analysis

242
from aiskillstore/marketplace

This skill should be used when the user asks to "calculate TAM", "determine SAM", "estimate SOM", "size the market", "calculate market opportunity", "what's the total addressable market", or requests market sizing analysis for a startup or business opportunity.

error-diagnostics-error-analysis

242
from aiskillstore/marketplace

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

error-debugging-error-analysis

242
from aiskillstore/marketplace

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

binary-analysis-patterns

242
from aiskillstore/marketplace

Master binary analysis patterns including disassembly, decompilation, control flow analysis, and code pattern recognition. Use when analyzing executables, understanding compiled code, or performing static analysis on binaries.

azure-ai-vision-imageanalysis-py

242
from aiskillstore/marketplace

Azure AI Vision Image Analysis SDK for captions, tags, objects, OCR, people detection, and smart cropping. Use for computer vision and image understanding tasks. Triggers: "image analysis", "computer vision", "OCR", "object detection", "ImageAnalysisClient", "image caption".

azure-ai-vision-imageanalysis-java

242
from aiskillstore/marketplace

Build image analysis applications with Azure AI Vision SDK for Java. Use when implementing image captioning, OCR text extraction, object detection, tagging, or smart cropping.

vision-analysis

242
from aiskillstore/marketplace

Analyze, describe, and extract information from images using the MiniMax vision MCP tool. Use when: user shares an image file path or URL (any message containing .jpg, .jpeg, .png, .gif, .webp, .bmp, or .svg file extension) or uses any of these words/phrases near an image: "analyze", "analyse", "describe", "explain", "understand", "look at", "review", "extract text", "OCR", "what is in", "what's in", "read this image", "see this image", "tell me about", "explain this", "interpret this", in connection with an image, screenshot, diagram, chart, mockup, wireframe, or photo. Also triggers for: UI mockup review, wireframe analysis, design critique, data extraction from charts, object detection, person/animal/activity identification. Triggers: any message with an image file extension (jpg, jpeg, png, gif, webp, bmp, svg), or any request to analyze/describ/understand/review/extract text from an image, screenshot, diagram, chart, photo, mockup, or wireframe.