Incident Response Playbook

Structured incident response for business and IT teams. Guides you through detection, triage, containment, resolution, and post-mortem — with auto-generated timelines and action items.

3,891 stars
Complexity: easy

About this skill

This AI agent skill provides a comprehensive framework for managing critical incidents across business and IT operations. When presented with an incident description or a proactive scenario, it systematically addresses key response phases: detection, triage, containment, resolution, and post-mortem analysis. Key capabilities include classifying incident severity (P1-P4) based on impact and urgency, generating a tailored response checklist for various incident types (e.g., outages, data breaches, vendor failures), and building a communication plan. It also dynamically creates a real-time timeline as updates are logged and produces a post-mortem template to facilitate root cause analysis and preventative measures. Users can leverage this skill both reactively, by describing an active incident, and proactively, to prepare response plans for potential scenarios. It aims to standardize incident management, reduce response times, and ensure all critical steps are followed consistently.

Best use case

The primary use case is to standardize and streamline the incident response process for IT and business teams. It benefits incident commanders, SREs, developers, and operations staff by providing an AI-driven assistant that helps organize and execute incident management, ensuring consistent communication and thorough post-incident analysis.

Structured incident response for business and IT teams. Guides you through detection, triage, containment, resolution, and post-mortem — with auto-generated timelines and action items.

Users can expect a detailed incident response plan, including severity classification, a tailored checklist, a communication strategy, an evolving timeline, and a post-mortem template, all generated based on the provided incident context.

Practical example

Example input

Production API is returning 500 errors for 20% of requests. Started 10 minutes ago.

Example output

Incident P2: API Degradation. Initial actions: Verify impact, check recent deployments, prepare rollback. Communication: Notify engineering lead, create Slack channel. Timeline started. Post-mortem template ready, focusing on root cause analysis and prevention.

When to use this skill

  • When a service outage or performance degradation is detected and needs immediate attention.
  • For proactively planning response strategies for potential security breaches or data incidents.
  • To generate a structured communication plan for stakeholders during an active incident.
  • After an incident is resolved, to facilitate a thorough post-mortem analysis and learning process.

When not to use this skill

  • For trivial, self-resolving issues that do not warrant a formal incident response.
  • As a replacement for experienced human incident commanders, but rather as a powerful assistant.
  • For non-technical or personal issues that fall outside of business and IT operations.
  • If the incident requires highly specialized, ad-hoc human intervention that cannot be guided by a structured playbook.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/afrexai-incident-response/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/1kalin/afrexai-incident-response/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/afrexai-incident-response/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How Incident Response Playbook Compares

Feature / AgentIncident Response PlaybookStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

Structured incident response for business and IT teams. Guides you through detection, triage, containment, resolution, and post-mortem — with auto-generated timelines and action items.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Incident Response Playbook

Structured incident response for business and IT teams. Guides you through detection, triage, containment, resolution, and post-mortem — with auto-generated timelines and action items.

## What It Does

When triggered with an incident description, this skill:

1. **Classifies severity** (P1-P4) based on impact and urgency
2. **Generates a response checklist** tailored to incident type (outage, data breach, security event, service degradation, vendor failure)
3. **Builds a communication plan** — who to notify, when, what channels
4. **Creates a real-time timeline** as you log updates
5. **Produces a post-mortem template** with root cause analysis and prevention steps

## Usage

Tell your agent about an incident:

> "Production API is returning 500 errors for 20% of requests. Started 10 minutes ago."

Or trigger proactively:

> "Create an incident response plan for a potential data breach scenario"

## Incident Types Covered

- **Service outages** — full or partial downtime
- **Security incidents** — breaches, unauthorized access, phishing
- **Data incidents** — corruption, loss, privacy violations
- **Vendor failures** — third-party SLA breaches
- **Performance degradation** — latency spikes, capacity issues

## Severity Matrix

| Level | Impact | Response Time | Escalation |
|-------|--------|---------------|------------|
| P1 - Critical | Business stopped | Immediate | Executive + all hands |
| P2 - High | Major feature down | < 30 min | Engineering lead + PM |
| P3 - Medium | Degraded experience | < 2 hours | On-call team |
| P4 - Low | Minor issue | Next business day | Ticket queue |

## Response Framework

### 1. Detection & Triage (First 5 minutes)
- Confirm the incident is real (not a false alarm)
- Classify severity using the matrix above
- Assign incident commander
- Open a dedicated communication channel

### 2. Containment (First 30 minutes)
- Identify blast radius — what's affected?
- Apply immediate mitigation (rollback, feature flag, scaling)
- Communicate status to stakeholders

### 3. Resolution
- Root cause investigation
- Implement fix with verification
- Monitor for recurrence
- Update all stakeholders

### 4. Post-Mortem (Within 48 hours)
- Timeline of events
- Root cause analysis (5 Whys)
- What went well / what didn't
- Action items with owners and deadlines
- Process improvements

## Integration

Works with any monitoring stack. Feed alerts from PagerDuty, Datadog, Grafana, or manual reports.

## Pro Tip

Pair this with a full **AI Operations Context Pack** for your industry. Pre-built incident taxonomies, compliance-aware escalation paths, and automated stakeholder templates.

Browse packs: https://afrexai-cto.github.io/context-packs/

Free tools:
- AI Revenue Calculator: https://afrexai-cto.github.io/ai-revenue-calculator/
- Agent Setup Wizard: https://afrexai-cto.github.io/agent-setup/

Related Skills

Incident Postmortem Generator

3891
from openclaw/skills

Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.

DevOps & Infrastructure

Post-Mortem & Incident Review Framework

3891
from openclaw/skills

Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.

DevOps & Infrastructure

botlearn-healthcheck

3891
from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure

afrexai-performance-engineering

3891
from openclaw/skills

Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applications, optimizing code/queries/infrastructure, load testing before launch, planning capacity, or building performance into CI/CD. Covers Node.js, Python, Go, Java, databases, APIs, and frontend.

DevOps & Infrastructure

OpenClaw Mastery — The Complete Agent Engineering & Operations System

3891
from openclaw/skills

> Built by AfrexAI — the team that runs 9+ production agents 24/7 on OpenClaw.

DevOps & Infrastructure

Legacy System Modernization Engine

3891
from openclaw/skills

Complete methodology for assessing, planning, and executing legacy system modernization — from monolith decomposition to cloud migration. Works for any tech stack, any scale.

DevOps & Infrastructure

Git Engineering & Repository Strategy

3891
from openclaw/skills

You are a Git Engineering expert. You help teams design branching strategies, implement code review workflows, manage monorepos, automate releases, and maintain healthy repository practices at scale.

DevOps & Infrastructure

Django Production Engineering

3891
from openclaw/skills

Complete methodology for building, scaling, and operating production Django applications. From project structure to deployment, security to performance — every decision framework a Django team needs.

DevOps & Infrastructure

IT Disaster Recovery Plan Generator

3891
from openclaw/skills

Build production-ready disaster recovery plans that actually get followed when things break.

DevOps & Infrastructure

afrexai-api-architect

3891
from openclaw/skills

Design, build, test, document, and secure production-grade APIs. Covers the full lifecycle from schema design through deployment, monitoring, and versioning. Use when designing new APIs, reviewing existing ones, generating OpenAPI specs, building test suites, or debugging production issues.

DevOps & Infrastructure

Agent Ops Runbook

3891
from openclaw/skills

Generate a production-ready operations runbook for deploying AI agents. Covers pre-deployment checklists, shadow mode → supervised → autonomous rollout stages, monitoring dashboards, rollback procedures, cost management, and incident response templates.

DevOps & Infrastructure

node-red-manager

3891
from openclaw/skills

Manage Node-RED instances via Admin API or CLI. Automate flow deployment, install nodes, and troubleshoot issues. Use when user wants to "build automation", "connect devices", or "fix node-red".

DevOps & Infrastructure