qe-chaos-resilience
Injects controlled faults (network partition, latency, process kill, disk pressure) into distributed systems and validates recovery behavior. Use when testing circuit breakers, failover paths, retry logic, or building confidence in system resilience through chaos engineering.
Best use case
qe-chaos-resilience is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Injects controlled faults (network partition, latency, process kill, disk pressure) into distributed systems and validates recovery behavior. Use when testing circuit breakers, failover paths, retry logic, or building confidence in system resilience through chaos engineering.
Teams using qe-chaos-resilience should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/qe-chaos-resilience/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How qe-chaos-resilience Compares
| Feature / Agent | qe-chaos-resilience | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Injects controlled faults (network partition, latency, process kill, disk pressure) into distributed systems and validates recovery behavior. Use when testing circuit breakers, failover paths, retry logic, or building confidence in system resilience through chaos engineering.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# QE Chaos Resilience
## Purpose
Guide the use of v3's chaos engineering capabilities including controlled fault injection, load/stress testing, resilience validation, and disaster recovery testing.
## Activation
- When testing system resilience
- When performing chaos experiments
- When load/stress testing
- When validating disaster recovery
- When testing circuit breakers
## Quick Start
```bash
# Run chaos experiment
aqe chaos run --experiment network-latency --target api-service
# Load test
aqe chaos load --scenario peak-traffic --duration 30m
# Stress test to breaking point
aqe chaos stress --endpoint /api/users --max-users 10000
# Test circuit breaker
aqe chaos circuit-breaker --service payment-service
```
## Agent Workflow
```typescript
// Chaos experiment
Task("Run chaos experiment", `
Execute controlled chaos on api-service:
- Inject 500ms network latency
- Monitor service health metrics
- Verify circuit breaker activation
- Measure recovery time
- Document findings
`, "qe-chaos-engineer")
// Load testing
Task("Performance load test", `
Run load test simulating Black Friday traffic:
- Ramp up to 10,000 concurrent users
- Maintain load for 30 minutes
- Monitor response times and error rates
- Identify bottlenecks
- Compare against SLAs
`, "qe-load-tester")
```
## Chaos Experiments
### 1. Fault Injection
```typescript
await chaosEngineer.injectFault({
target: 'api-service',
fault: {
type: 'latency',
parameters: {
delay: '500ms',
jitter: '100ms',
percentage: 50
}
},
duration: '5m',
monitoring: {
metrics: ['response_time', 'error_rate', 'throughput'],
alerts: true
},
rollback: {
automatic: true,
trigger: 'error_rate > 10%'
}
});
```
### 2. Load Testing
```typescript
await loadTester.execute({
scenario: 'peak-traffic',
profile: {
rampUp: '5m',
steadyState: '30m',
rampDown: '5m'
},
users: {
initial: 100,
target: 5000,
pattern: 'linear'
},
assertions: {
p95_latency: '<500ms',
error_rate: '<1%',
throughput: '>1000rps'
}
});
```
### 3. Stress Testing
```typescript
await loadTester.stressTest({
endpoint: '/api/checkout',
strategy: 'step-increase',
steps: [100, 500, 1000, 2000, 5000],
stepDuration: '5m',
findBreakingPoint: true,
monitoring: {
resourceUtilization: true,
databaseConnections: true,
memoryUsage: true
}
});
```
### 4. Resilience Validation
```typescript
await resilienceTester.validate({
scenarios: [
'database-failover',
'cache-failure',
'external-service-timeout',
'pod-termination'
],
expectations: {
gracefulDegradation: true,
automaticRecovery: true,
dataIntegrity: true,
recoveryTime: '<30s'
}
});
```
## Fault Types
| Fault | Description | Use Case |
|-------|-------------|----------|
| Latency | Add network delay | Test timeouts |
| Packet Loss | Drop network packets | Test retry logic |
| CPU Stress | Consume CPU | Test resource limits |
| Memory Pressure | Consume memory | Test OOM handling |
| Disk Full | Fill disk space | Test disk errors |
| Process Kill | Terminate process | Test recovery |
## Chaos Report
```typescript
interface ChaosReport {
experiment: {
name: string;
target: string;
fault: FaultConfig;
duration: number;
};
results: {
hypothesis: string;
validated: boolean;
metrics: {
before: MetricSnapshot;
during: MetricSnapshot;
after: MetricSnapshot;
};
events: ChaosEvent[];
recovery: {
detected: boolean;
time: number;
automatic: boolean;
};
};
findings: {
severity: 'critical' | 'high' | 'medium' | 'low';
description: string;
recommendation: string;
}[];
artifacts: {
logs: string;
metrics: string;
traces: string;
};
}
```
## Safety Controls
```yaml
safety:
blast_radius:
max_affected_pods: 1
max_affected_percentage: 10
abort_conditions:
- error_rate > 50%
- p99_latency > 10s
- service_unavailable
excluded_environments:
- production-critical
required_approvals:
production: 2
staging: 0
```
## SLA Validation
```typescript
await resilienceTester.validateSLA({
slas: {
availability: 99.9,
p95_latency: 500,
error_rate: 0.1
},
period: '30d',
report: {
breaches: true,
trends: true,
projections: true
}
});
```
## Coordination
**Primary Agents**: qe-chaos-engineer, qe-load-tester, qe-resilience-tester
**Coordinator**: qe-chaos-coordinator
**Related Skills**: qe-performance, security-testingRelated Skills
chaos-engineering-resilience
Chaos engineering principles, controlled failure injection, resilience testing, and system recovery validation. Use when testing distributed systems, building confidence in fault tolerance, or validating disaster recovery.
qe-chaos-engineering-resilience
Chaos engineering principles, controlled failure injection, resilience testing, and system recovery validation. Use when testing distributed systems, building confidence in fault tolerance, or validating disaster recovery.
qe-visual-testing-advanced
Advanced visual regression testing with pixel-perfect comparison, AI-powered diff analysis, responsive design validation, and cross-browser visual consistency. Use when detecting UI regressions, validating designs, or ensuring visual consistency.
qe-verification-quality
Comprehensive truth scoring, code quality verification, and automatic rollback system with 0.95 accuracy threshold for ensuring high-quality agent outputs and codebase reliability.
qe-testability-scoring
AI-powered testability assessment using 10 principles of intrinsic testability with Playwright and optional Vibium integration. Evaluates web applications against Observability, Controllability, Algorithmic Simplicity, Transparency, Stability, Explainability, Unbugginess, Smallness, Decomposability, and Similarity. Use when assessing software testability, evaluating test readiness, identifying testability improvements, or generating testability reports.
qe-test-reporting-analytics
Advanced test reporting, quality dashboards, predictive analytics, trend analysis, and executive reporting for QE metrics. Use when communicating quality status, tracking trends, or making data-driven decisions.
qe-test-idea-rewriting
Transform passive 'Verify X' test descriptions into active, observable test actions. Use when test ideas lack specificity, use vague language, or fail quality validation. Converts to action-verb format for clearer, more testable descriptions.
qe-test-environment-management
Test environment provisioning, infrastructure as code for testing, Docker/Kubernetes for test environments, service virtualization, and cost optimization. Use when managing test infrastructure, ensuring environment parity, or optimizing testing costs.
qe-test-design-techniques
Systematic test design with boundary value analysis, equivalence partitioning, decision tables, state transition testing, and combinatorial testing. Use when designing comprehensive test cases, reducing redundant tests, or ensuring systematic coverage.
qe-test-data-management
Strategic test data generation, management, and privacy compliance. Use when creating test data, handling PII, ensuring GDPR/CCPA compliance, or scaling data generation for realistic testing scenarios.
qe-test-automation-strategy
Design and implement effective test automation with proper pyramid, patterns, and CI/CD integration. Use when building automation frameworks or improving test efficiency.
qe-technical-writing
Write clear, engaging technical content from real experience. Use when writing blog posts, documentation, tutorials, or technical articles.