incident-runbook-templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

31,392 stars
Complexity: easy

About this skill

This skill provides an AI agent with access to a structured set of production-ready templates for incident response runbooks. It empowers the agent to assist in creating, outlining, or explaining incident management processes, covering all critical phases from initial detection and triage to effective mitigation, final resolution, and essential communication protocols. The templates guide the AI in generating actionable steps and best practices for various incident scenarios, ensuring a consistent and robust approach to operational resilience.

Best use case

An AI agent needs to assist a user in drafting an incident response plan, explain incident management best practices, structure communication during an outage, or provide a checklist for resolving a critical system failure.

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

The AI agent will provide a structured output that outlines steps for incident response, offers relevant sections of runbook templates, suggests best practices for handling an incident, or helps formulate incident-related communications, all guided by the pre-defined runbook templates.

Practical example

Example input

Our primary production database is experiencing high latency and connection timeouts. Generate an initial incident response plan covering immediate triage, stakeholder communication, and potential mitigation steps based on standard runbook procedures.

Example output

### Incident Response Plan: Production Database High Latency

**1. Detection & Verification:**
*   **Confirm Metrics:** Check database dashboards for CPU, memory, I/O, and active connections. Verify latency spikes using monitoring tools.
*   **User Impact:** Cross-reference with user reports or error logs from dependent services.
*   **Initial Triage:** Engage on-call DBA and relevant engineering teams.

**2. Triage & Diagnosis:**
*   **Assess Severity:** Critical - customer-facing impact likely. Elevate to P1.
*   **Recent Changes:** Investigate recent deployments, configuration changes, or large data loads.
*   **Resource Utilization:** Check for rogue queries, connection floods, or resource contention.
*   **Error Logs:** Analyze database error logs for specific warnings or failures.

**3. Mitigation (Initial Steps):**
*   **Isolation:** If a specific query or service is identified as the cause, attempt to isolate or throttle it.
*   **Scaling (if applicable):** If resources are maxed out, consider scaling up/out the database instance or read replicas.
*   **Failover (if configured):** If the primary instance is unrecoverable or severely degraded, initiate failover to a healthy replica.
*   **Rollback:** If a recent change is confirmed as the root cause, initiate a rollback.

**4. Communication (Internal & External):**
*   **Internal (Slack/Email):** 
    *   **Initial Alert:** "P1 Incident: Production Database High Latency affecting [Service Name]. On-call team engaged. Updates to follow." 
    *   **Regular Updates:** Provide updates every 15-30 minutes on diagnosis progress and mitigation attempts.
*   **External (Status Page - if public impact):**
    *   **Initial:** "Investigating: We are currently investigating an issue impacting database performance, which may affect [Service Name]. More updates shortly."
    *   **Update:** "Monitoring: Our engineers have implemented [mitigation step, e.g., 'a temporary fix'] and are closely monitoring the database performance. Services may be recovering. Further updates will be provided."

**5. Resolution & Post-Mortem (Once mitigated):**
*   **Verify Resolution:** Ensure all metrics return to normal, and user impact is resolved.
*   **Document:** Record incident timeline, actions taken, and initial findings.
*   **Post-Mortem Initiation:** Schedule a post-mortem to identify root cause, permanent fixes, and preventive measures.

*Refer to `resources/implementation-playbook.md` for detailed examples on specific mitigation strategies for database incidents.*

When to use this skill

  • When you need to generate a structured incident response plan or runbook.
  • When seeking guidance on best practices for incident detection, triage, or mitigation.
  • When preparing communication strategies for various incident types.
  • When an AI agent needs to provide actionable steps for incident resolution.

When not to use this skill

  • When the task is unrelated to incident runbook templates or incident management.
  • When you need to execute direct commands on systems or tools outside the scope of runbook generation.
  • When the primary goal is not to structure or understand incident response processes.
  • When specific operational actions (e.g., restarting a service) are required, rather than planning or documentation assistance.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/incident-runbook-templates/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/incident-runbook-templates/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/incident-runbook-templates/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How incident-runbook-templates Compares

Feature / Agentincident-runbook-templatesStandard Approach
Platform SupportClaudeLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

## Do not use this skill when

- The task is unrelated to incident runbook templates
- You need a different domain or tool outside this scope

## Instructions

- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.

## Use this skill when

- Creating incident response procedures
- Building service-specific runbooks
- Establishing escalation paths
- Documenting recovery procedures
- Responding to active incidents
- Onboarding on-call engineers

## Core Concepts

### 1. Incident Severity Levels

| Severity | Impact | Response Time | Example |
|----------|--------|---------------|---------|
| **SEV1** | Complete outage, data loss | 15 min | Production down |
| **SEV2** | Major degradation | 30 min | Critical feature broken |
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |

### 2. Runbook Structure

```
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
```

## Runbook Templates

### Template 1: Service Outage Runbook

```markdown
# [Service Name] Outage Runbook

## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall

## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?

## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)

### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)

## Initial Triage (First 5 Minutes)

### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service

# Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
```

### 2. Quick Health Checks
- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
- [ ] Database connectivity? Check connection pool metrics
- [ ] External dependencies? Check Stripe, bank API status
- [ ] Recent changes? Check deploy history

### 3. Initial Classification
| Symptom | Likely Cause | Go To Section |
|---------|--------------|---------------|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |

## Mitigation Procedures

### 4.1 Service Completely Down
```bash
# Step 1: Check pod status
kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments
```

### 4.2 High Latency
```bash
# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
  curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
  SELECT pid, now() - query_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active' AND duration > interval '5 seconds'
  ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
```

### 4.3 Partial Failures (Specific Errors)
```bash
# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
  grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
  SELECT * FROM audit_log
  WHERE table_name = 'payment_methods'
  AND created_at > now() - interval '1 hour';"
```

### 4.4 Traffic Surge
```bash
# Step 1: Check current request rate
kubectl top pods -n payments

# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
  RATE_LIMIT_ENABLED=true \
  RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-suspicious
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-service
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.168.1.0/24  # Suspicious range
EOF
```

## Verification Steps
```bash
# Verify service is healthy
curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows
./scripts/smoke-test-payments.sh
```

## Rollback Procedures
```bash
# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
```

## Escalation Matrix

| Condition | Escalate To | Contact |
|-----------|-------------|---------|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |

## Communication Templates

### Initial Notification (Internal)
```
🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents
```

### Status Update
```
📊 UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes
```

### Resolution Notification
```
✅ RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress
```
```

### Template 2: Database Incident Runbook

```markdown
# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |

## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
```

## Replication Lag
```sql
-- Check lag on replica
SELECT
  CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
  END AS lag_seconds;

-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable
```

## Disk Space Critical
```bash
# Check disk usage
df -h /var/lib/postgresql/data

# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"

# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk
```
```

## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

## Resources

- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)

Related Skills

incident-response-smart-fix

31392
from sickn33/antigravity-awesome-skills

[Extended thinking: This workflow implements a sophisticated debugging and resolution pipeline that leverages AI-assisted debugging tools and observability platforms to systematically diagnose and res

DevOps & InfrastructureClaudeGitHub Copilot

incident-responder

31392
from sickn33/antigravity-awesome-skills

Expert SRE incident responder specializing in rapid problem resolution, modern observability, and comprehensive incident management.

DevOps & InfrastructureClaude

linux-shell-scripting

31392
from sickn33/antigravity-awesome-skills

Provide production-ready shell script templates for common Linux system administration tasks including backups, monitoring, user management, log analysis, and automation. These scripts serve as building blocks for security operations and penetration testing environments.

DevOps & InfrastructureClaude

iterate-pr

31392
from sickn33/antigravity-awesome-skills

Iterate on a PR until CI passes. Use when you need to fix CI failures, address review feedback, or continuously push fixes until all checks are green. Automates the feedback-fix-push-wait cycle.

DevOps & InfrastructureClaude

istio-traffic-management

31392
from sickn33/antigravity-awesome-skills

Comprehensive guide to Istio traffic management for production service mesh deployments.

DevOps & InfrastructureClaude

expo-cicd-workflows

31392
from sickn33/antigravity-awesome-skills

Helps understand and write EAS workflow YAML files for Expo projects. Use this skill when the user asks about CI/CD or workflows in an Expo or EAS context, mentions .eas/workflows/, or wants help with EAS build pipelines or deployment automation.

DevOps & InfrastructureClaude

error-diagnostics-error-trace

31392
from sickn33/antigravity-awesome-skills

You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging,

DevOps & InfrastructureClaude

error-debugging-error-trace

31392
from sickn33/antigravity-awesome-skills

You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging, and ensure teams can quickly identify and resolve production issues.

DevOps & InfrastructureClaude

error-debugging-error-analysis

31392
from sickn33/antigravity-awesome-skills

You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.

DevOps & InfrastructureClaude

docker-expert

31392
from sickn33/antigravity-awesome-skills

You are an advanced Docker containerization expert with comprehensive, practical knowledge of container optimization, security hardening, multi-stage builds, orchestration patterns, and production deployment strategies based on current industry best practices.

DevOps & InfrastructureClaude

devops-troubleshooter

31392
from sickn33/antigravity-awesome-skills

Expert DevOps troubleshooter specializing in rapid incident response, advanced debugging, and modern observability.

DevOps & InfrastructureClaude

devops-deploy

31392
from sickn33/antigravity-awesome-skills

DevOps e deploy de aplicacoes — Docker, CI/CD com GitHub Actions, AWS Lambda, SAM, Terraform, infraestrutura como codigo e monitoramento.

DevOps & InfrastructureClaudeCursorGemini