chaos-engineering-fundamentals

Use when implementing chaos engineering, designing fault injection experiments, or building resilience testing practices. Covers chaos principles and experiment design.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

chaos-engineering-fundamentals is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Use when implementing chaos engineering, designing fault injection experiments, or building resilience testing practices. Covers chaos principles and experiment design.

Teams using chaos-engineering-fundamentals should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/chaos-engineering-fundamentals/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/design/chaos-engineering-fundamentals/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/chaos-engineering-fundamentals/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How chaos-engineering-fundamentals Compares

Feature / Agent	chaos-engineering-fundamentals	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Use when implementing chaos engineering, designing fault injection experiments, or building resilience testing practices. Covers chaos principles and experiment design.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Chaos Engineering Fundamentals

Principles and practices for chaos engineering - proactively discovering system weaknesses through controlled experiments.

## When to Use This Skill

- Implementing chaos engineering practices
- Designing fault injection experiments
- Building confidence in system resilience
- Discovering hidden failure modes
- Validating disaster recovery

## What is Chaos Engineering?

```text
Chaos Engineering = Proactive resilience testing

Traditional testing: "Does it work when everything is right?"
Chaos engineering: "Does it work when things go wrong?"

Principle:
Build confidence in the system's ability to withstand
turbulent conditions in production.

Not about breaking things randomly.
About controlled experiments to learn.
```

## The Chaos Engineering Loop

```text
┌─────────────────────────────────────────────────────────┐
│                CHAOS ENGINEERING LOOP                    │
│                                                          │
│   ┌─────────┐     ┌─────────┐     ┌─────────┐          │
│   │ Define  │────►│ Inject  │────►│ Observe │          │
│   │ Steady  │     │ Chaos   │     │ Results │          │
│   │ State   │     │         │     │         │          │
│   └─────────┘     └─────────┘     └────┬────┘          │
│        ▲                               │                │
│        │                               │                │
│        │          ┌─────────┐          │                │
│        └──────────│ Improve │◄─────────┘                │
│                   │ System  │                           │
│                   └─────────┘                           │
└─────────────────────────────────────────────────────────┘
```

## Core Principles

### 1. Build Hypothesis Around Steady State

```text
Steady State = Normal system behavior

Define measurable indicators:
- Request success rate: 99.9%
- Latency p99: < 200ms
- Orders processed/minute: > 100
- User sessions active: > 10,000

Hypothesis format:
"When [fault condition] occurs, the system will
maintain [steady state metrics] within [acceptable bounds]"

Example:
"When one database replica fails, request success rate
will remain above 99.5% and latency below 500ms"
```

### 2. Vary Real-World Events

```text
Inject realistic failures:

Infrastructure:
- Server crash
- Network partition
- Disk full
- CPU exhaustion
- Clock skew

Application:
- Service unavailable
- Slow responses
- Corrupted data
- Certificate expiry
- Resource exhaustion

Dependencies:
- Database failure
- Cache unavailable
- Third-party API down
- Message queue backup
```

### 3. Run Experiments in Production

```text
Why production?
- Real traffic patterns
- Real infrastructure
- Real dependencies
- Real monitoring

Start safe:
1. Begin in non-production
2. Graduate to canary
3. Progress to production
4. Expand blast radius gradually

Safety nets:
- Kill switch ready
- Rollback plan
- Limited blast radius
- Monitoring in place
```

### 4. Automate Experiments to Run Continuously

```text
One-time experiments find one-time bugs.
Continuous experiments catch regressions.

Automation goals:
- Run experiments regularly
- Integrate with CI/CD
- Catch new failure modes
- Validate changes

Example schedule:
- Critical paths: Daily
- Core services: Weekly
- Full system: Monthly
```

### 5. Minimize Blast Radius

```text
Control experiment impact:

Scope limitations:
- Single instance
- Percentage of traffic
- Specific region
- Test accounts only

Duration limits:
- Seconds to minutes
- Automatic termination
- Scheduled windows

Abort conditions:
- Error rate exceeds threshold
- Customer impact detected
- Manual kill switch
```

## Experiment Design

### Experiment Structure

```text
Experiment: [Name]
Date: [When]
Team: [Who]

## Hypothesis
When [fault is injected], the system will [expected behavior]
because [reasoning].

## Steady State Metrics
- [Metric 1]: [Expected value]
- [Metric 2]: [Expected value]

## Experiment Details
Fault Type: [What we're injecting]
Target: [Where we're injecting]
Magnitude: [How severe]
Duration: [How long]

## Blast Radius
- Affected services: [List]
- Affected users: [Percentage/count]
- Region/zone: [Scope]

## Abort Conditions
- [Condition 1] → Abort
- [Condition 2] → Abort

## Rollback Plan
1. [Step 1]
2. [Step 2]

## Results
Hypothesis: [Confirmed/Falsified]
Observations: [What we saw]
Action Items: [What to fix]
```

### Common Experiment Types

```text
1. Service Failure
   └── Kill instances, return errors

2. Network Failures
   └── Latency injection, packet loss, partitions

3. Resource Exhaustion
   └── CPU stress, memory pressure, disk full

4. Dependency Failures
   └── Database down, cache miss, API timeout

5. State Corruption
   └── Clock skew, data inconsistency

6. Traffic Surge
   └── Sudden load increase
```

## Fault Injection Patterns

### Infrastructure Faults

```text
Instance termination:
- Kill random instances
- Verify auto-scaling/recovery
- Netflix Chaos Monkey style

Zone/region failure:
- Simulate full zone outage
- Test failover to other zones
- Verify data consistency

Network partition:
- Split brain scenarios
- Cross-region communication failure
- Consensus algorithm behavior
```

### Application Faults

```text
Latency injection:
- Add artificial delay
- Test timeout handling
- Verify circuit breakers

Error injection:
- Return 500 errors
- Throw exceptions
- Test error handling paths

Resource leaks:
- Memory leaks
- Connection pool exhaustion
- File handle exhaustion
```

### Dependency Faults

```text
Database failures:
- Primary failover
- Replica lag
- Connection pool exhaustion

Cache failures:
- Cache miss scenarios
- Cache cluster failure
- Stampede protection

External API failures:
- Timeout
- Rate limiting
- Malformed responses
```

## Chaos Engineering Tools

### Open Source Tools

```text
Chaos Monkey (Netflix)
- Random instance termination
- AWS focused
- Part of Simian Army

Gremlin
- Comprehensive chaos platform
- Multiple attack types
- Enterprise features

Litmus
- Kubernetes native
- ChaosHub experiment library
- GitOps friendly

Chaos Mesh
- Kubernetes native
- Various fault types
- Dashboard included

Pumba
- Docker chaos testing
- Container-level faults
- CI/CD integration
```

### Cloud Provider Tools

```text
AWS:
- Fault Injection Simulator
- Native integration

Azure:
- Chaos Studio
- Azure-native experiments

GCP:
- No native tool (use Gremlin/Litmus)
```

## Implementation Strategy

### Maturity Model

```text
Level 0: Ad-hoc
- Manual testing
- No chaos practice
- Reactive to failures

Level 1: Beginning
- First experiments
- Non-production only
- Manual execution

Level 2: Intermediate
- Regular experiments
- Production experiments
- Some automation

Level 3: Advanced
- Continuous chaos
- Automated experiments
- Broad coverage

Level 4: Expert
- Chaos as code
- Integrated in CI/CD
- GameDays regular
```

### Getting Started

```text
Week 1-2: Foundation
- Identify critical paths
- Define steady state metrics
- Set up monitoring

Week 3-4: First Experiments
- Start with known failures
- Run in non-production
- Document learnings

Month 2: Expand
- Add more experiment types
- Move to production (carefully)
- Automate basic experiments

Month 3+: Mature
- Regular GameDays
- Continuous experiments
- Integrate with CI/CD
```

## GameDays

### What is a GameDay?

```text
GameDay = Planned chaos exercise

Like a fire drill for systems:
- Scheduled in advance
- Multiple failure scenarios
- Practice incident response
- Learn and improve
```

### GameDay Structure

```text
Before:
- Define objectives
- Plan scenarios
- Notify stakeholders
- Prepare runbooks
- Set up monitoring

During:
- Run scenarios
- Observe system behavior
- Practice incident response
- Document findings

After:
- Debrief meeting
- Document learnings
- Create action items
- Plan next GameDay
```

### GameDay Scenarios

```text
Scenario categories:

1. Infrastructure
   - Region failure
   - Network partition
   - Scaling limits

2. Application
   - Service outage
   - Deployment failure
   - Configuration error

3. Data
   - Database corruption
   - Backup restoration
   - Data center switch

4. Security
   - Credential rotation
   - Certificate expiry
   - Access revocation
```

## Safety and Guardrails

### Experiment Safety

```text
Before running chaos:

1. Have a hypothesis
   Know what you're testing

2. Limit blast radius
   Start small, expand gradually

3. Have abort conditions
   Automatic stops

4. Have rollback plan
   Know how to undo

5. Monitor everything
   Can't learn what you can't see

6. Communicate
   Team knows experiment is running
```

### Kill Switch

```text
Every experiment needs:

Manual kill switch:
- Instant termination
- Accessible to multiple people
- Tested before experiment

Automatic abort:
- Error rate threshold
- Latency threshold
- Customer impact detection

Notification:
- Alert when abort triggered
- Log reason for abort
```

## Measuring Success

```text
Chaos engineering success metrics:

1. Experiments run
   Are we doing chaos regularly?

2. Issues discovered
   Are we finding problems?

3. MTTR improvement
   Are we recovering faster?

4. Incident prevention
   Did chaos prevent production incidents?

5. Confidence level
   Does team trust the system more?
```

## Best Practices

```text
1. Start small
   Begin with simple experiments

2. Hypothesis first
   Know what you're testing

3. Automate gradually
   Manual first, then automate

4. Production eventually
   That's where real chaos lives

5. Blameless culture
   Findings are learnings, not failures

6. Regular GameDays
   Practice makes prepared

7. Share learnings
   Spread knowledge across teams
```

## Related Skills

- `resilience-patterns` - Building resilient systems
- `gameday-planning` - Detailed GameDay planning
- `incident-response` - Handling discovered issues

Related Skills

Chaos Engineering

from diegosouzapw/awesome-omni-skill

Design and execute controlled failure experiments to validate system resilience

Prompt Engineering Skill

from diegosouzapw/awesome-omni-skill

Craft effective prompts that get the best results from language models.

prompt-engineering-openai-api-f7c24501

from diegosouzapw/awesome-omni-skill

geo-fundamentals

from diegosouzapw/awesome-omni-skill

Tối ưu hóa công cụ tìm kiếm chuẩn AI (GEO) và chiến lược hiển thị tổng.

data-engineering-data-pipeline

from diegosouzapw/awesome-omni-skill

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

context-fundamentals

from diegosouzapw/awesome-omni-skill

Understand the components, mechanics, and constraints of context in agent systems. Use when designing agent architectures, debugging context-related failures, or optimizing context usage.

context-engineering

from diegosouzapw/awesome-omni-skill

Use when designing agent system prompts, optimizing RAG retrieval, or when context is too expensive or slow. Reduces tokens while maintaining quality through strategic positioning and attention-aware design.

Build Your Data Engineering Skill

from diegosouzapw/awesome-omni-skill

Create your LLMOps data engineering skill in one prompt, then learn to improve it throughout the chapter

ai-engineering-skill

from diegosouzapw/awesome-omni-skill

Practical guide for building production ML systems based on Chip Huyen's AI Engineering book. Use when users ask about model evaluation, deployment strategies, monitoring, data pipelines, feature engineering, cost optimization, or MLOps. Covers metrics, A/B testing, serving patterns, drift detection, and production best practices.

ai-data-engineering

from diegosouzapw/awesome-omni-skill

Data pipelines, feature stores, and embedding generation for AI/ML systems. Use when building RAG pipelines, ML feature serving, or data transformations. Covers feature stores (Feast, Tecton), embedding pipelines, chunking strategies, orchestration (Dagster, Prefect, Airflow), dbt transformations, data versioning (LakeFS), and experiment tracking (MLflow, W&B).

seo-fundamentals

from diegosouzapw/awesome-omni-skill

Core principles of SEO including E-E-A-T, Core Web Vitals, technical foundations, content quality, and how modern search engines evaluate pages.

Data Engineering Data Driven Feature

from diegosouzapw/awesome-omni-skill

World-class data science skill for statistical modeling, experimentation, causal inference, and advanced analytics. Expertise in Python (NumPy, Pandas, Scikit-learn), R, SQL, statistical methods, A/B testing, time series, and business intelligence. Includes experiment design, feature engineering, model evaluation, and stakeholder communication.