chaos-engineering
Failure injection patterns, blast radius control, steady state hypothesis, and gameday planning for resilience testing.
Best use case
chaos-engineering is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Failure injection patterns, blast radius control, steady state hypothesis, and gameday planning for resilience testing.
Teams using chaos-engineering should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/chaos-engineering/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How chaos-engineering Compares
| Feature / Agent | chaos-engineering | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Failure injection patterns, blast radius control, steady state hypothesis, and gameday planning for resilience testing.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Chaos Engineering
Systematic resilience testing to discover weaknesses before they cause outages.
## Steady State Hypothesis
```yaml
# Define BEFORE injecting chaos - what "normal" looks like
steady_state_hypothesis:
title: "API serves traffic within SLO"
probes:
- name: "API response time p95 < 500ms"
type: http
url: "https://api.example.com/health"
threshold: 500
- name: "Error rate < 1%"
type: prometheus
query: "rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m])"
threshold: 0.01
- name: "Order processing queue depth < 100"
type: cloudwatch
metric: "ApproximateNumberOfMessagesVisible"
threshold: 100
- name: "Database connections < 80% capacity"
type: prometheus
query: "pg_stat_activity_count / pg_settings_max_connections"
threshold: 0.8
```
## Failure Injection Patterns
```python
# Using Chaos Toolkit (chaostoolkit.org)
# experiment.json
{
"title": "Database failover resilience",
"description": "Verify app handles primary DB failover gracefully",
"steady-state-hypothesis": {
"title": "API responds normally",
"probes": [
{
"name": "api-health",
"type": "probe",
"provider": {
"type": "http",
"url": "https://api.example.com/health",
"timeout": 5
},
"tolerance": {"status": 200}
}
]
},
"method": [
{
"name": "failover-primary-db",
"type": "action",
"provider": {
"type": "python",
"module": "chaosaws.rds.actions",
"func": "failover_db_cluster",
"arguments": {
"db_cluster_identifier": "prod-cluster"
}
},
"pauses": {"after": 60}
}
],
"rollbacks": [
{
"name": "verify-db-recovered",
"type": "probe",
"provider": {
"type": "python",
"module": "chaosaws.rds.probes",
"func": "cluster_status",
"arguments": {
"db_cluster_identifier": "prod-cluster"
}
},
"tolerance": "available"
}
]
}
```
## Blast Radius Control
```python
# ALWAYS limit the impact of chaos experiments
class BlastRadiusController:
"""Control and limit chaos experiment impact."""
def __init__(self, config: dict):
self.max_affected_percentage = config.get('max_affected_pct', 5)
self.max_duration_seconds = config.get('max_duration_s', 300)
self.excluded_services = config.get('excluded', ['auth', 'payments'])
self.kill_switch_url = config.get('kill_switch_url')
def can_inject(self, target: str, scope: str) -> bool:
# Never chaos-test critical services without explicit approval
if target in self.excluded_services:
return False
# Never inject during peak hours
hour = datetime.now().hour
if 9 <= hour <= 17: # Business hours (adjust per timezone)
return False
# Never affect more than N% of instances
if self.get_affected_percentage(target, scope) > self.max_affected_percentage:
return False
return True
def get_affected_percentage(self, target: str, scope: str) -> float:
total = self.get_total_instances(target)
affected = self.get_affected_instances(target, scope)
return (affected / total) * 100 if total > 0 else 100
async def emergency_stop(self) -> None:
"""Kill switch: immediately halt all chaos experiments."""
await httpx.post(self.kill_switch_url, json={"action": "stop_all"})
```
## Common Chaos Experiments
```yaml
# Experiment catalog - start with these
level_1_basic:
- name: "Kill a single pod"
tool: "kubectl delete pod <name>"
validates: "Pod auto-recovery, health checks"
blast_radius: "1 pod"
- name: "CPU stress on one node"
tool: "stress-ng --cpu 4 --timeout 60"
validates: "Autoscaling, request routing"
blast_radius: "1 node"
- name: "Inject 500ms network latency"
tool: "tc qdisc add dev eth0 root netem delay 500ms"
validates: "Timeout handling, circuit breakers"
blast_radius: "1 container"
level_2_intermediate:
- name: "Kill entire availability zone"
tool: "Chaos Toolkit / AWS FIS"
validates: "Multi-AZ failover, data replication"
blast_radius: "1 AZ"
- name: "DNS resolution failure"
tool: "iptables -A OUTPUT -p udp --dport 53 -j DROP"
validates: "DNS caching, fallback resolution"
blast_radius: "1 service"
- name: "Disk fill to 95%"
tool: "fallocate -l 50G /tmp/disk_fill"
validates: "Disk space alerts, log rotation"
blast_radius: "1 node"
level_3_advanced:
- name: "Split brain network partition"
tool: "Toxiproxy / Linux iptables"
validates: "Consensus protocols, data consistency"
blast_radius: "Cluster segment"
- name: "Clock skew injection"
tool: "timedatectl set-time +5min"
validates: "Certificate validation, token expiry"
blast_radius: "1 node"
```
## Gameday Checklist
```markdown
## Pre-Gameday (1 week before)
- [ ] Define steady state hypothesis with measurable probes
- [ ] Identify blast radius and set hard limits
- [ ] Ensure kill switch is tested and accessible
- [ ] Notify on-call team and stakeholders
- [ ] Verify rollback procedures are documented and tested
- [ ] Set up monitoring dashboards for the experiment
- [ ] Run experiment in staging first
## During Gameday
- [ ] Verify steady state BEFORE injecting chaos
- [ ] Start with smallest blast radius, escalate gradually
- [ ] Monitor dashboards continuously during experiment
- [ ] Document observations in real-time (shared doc)
- [ ] If SLO violated: trigger kill switch immediately
- [ ] Time-box each experiment (max 5 minutes per injection)
## Post-Gameday
- [ ] Verify system returned to steady state
- [ ] Document findings: what broke, what recovered, what surprised
- [ ] Create action items for discovered weaknesses
- [ ] Update runbooks based on learnings
- [ ] Share results with broader engineering team
- [ ] Schedule fixes and re-test
```
## Checklist
- [ ] Define steady state hypothesis before every experiment
- [ ] Never run chaos in production without a tested kill switch
- [ ] Start in staging, graduate to production with reduced blast radius
- [ ] Exclude critical services (auth, payments) unless specifically targeting them
- [ ] Time-box experiments (max 5 minutes injection, 30 minutes observation)
- [ ] Run during low-traffic windows, never during peak
- [ ] Document every experiment: hypothesis, method, observations, findings
- [ ] Automate recurring experiments in CI/CD pipeline
## Anti-Patterns
- Chaos without hypothesis: random breaking is not engineering
- No kill switch: unable to stop experiment when things go wrong
- Running in production first: always validate in staging
- Affecting too many instances: never exceed 5% without explicit approval
- Chaos during incidents: only inject chaos on healthy systems
- Not fixing findings: experiments without follow-up action items are wastedRelated Skills
prompt-engineering
Prompt templates, few-shot examples, chain-of-thought, structured output, evals
workflow-router
Goal-based workflow orchestration - routes tasks to specialist agents based on user goals
wiring
Wiring Verification
websocket-patterns
Connection management, room patterns, reconnection strategies, message buffering, and binary protocol design.
visual-verdict
Screenshot comparison QA for frontend development. Takes a screenshot of the current implementation, scores it across multiple visual dimensions, and returns a structured PASS/REVISE/FAIL verdict with concrete fixes. Use when implementing UI from a design reference or verifying visual correctness.
verification-loop
Comprehensive verification system covering build, types, lint, tests, security, and diff review before a PR.
vector-db-patterns
Embedding strategies, ANN algorithms, hybrid search, RAG chunking strategies, and reranking for semantic search and retrieval.
variant-analysis
Find similar vulnerabilities across a codebase after discovering one instance. Uses pattern matching, AST search, Semgrep/CodeQL queries, and manual tracing to propagate findings. Adapted from Trail of Bits. Use after finding a bug to check if the same pattern exists elsewhere.
validate-agent
Validation agent that validates plan tech choices against current best practices
tracing-patterns
OpenTelemetry setup, span context propagation, sampling strategies, Jaeger queries
tour
Friendly onboarding tour of Claude Code capabilities for users asking what it can do.
tldr-stats
Show full session token usage, costs, TLDR savings, and hook activity