AI Agent Skill HUB

DevOps & Infrastructure

Post-Mortem & Incident Review Framework

Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.

3,891 stars

Complexity: easy

View on GitHub Installation ↓

About this skill

This AI agent skill provides a structured framework for conducting thorough post-mortems and incident reviews. It guides the user or the agent through a comprehensive process, covering incident summary, detailed timelines, 5 Whys root cause analysis, identification of contributing factors, acknowledgment of successes, and the generation of actionable items with owners and deadlines. The primary goal is to move beyond mere incident reporting to deep analytical reviews that identify systemic issues rather than individual blame. By following this framework, organizations can systematically learn from incidents, outages, missed deadlines, or project failures, ensuring that corrective actions are put in place to prevent recurrence. Users benefit from a standardized approach to incident response and review, fostering a culture of continuous improvement and accountability. It helps transform complex incident data into clear, actionable insights, leading to more resilient systems and processes.

Best use case

The primary use case is for engineering teams, SREs, project managers, and organizational leaders who need to conduct systematic reviews after adverse events. It's invaluable for transforming chaotic incident data into an organized, analytical report that drives learning and prevents repeat failures. Teams benefit by standardizing their incident response and review process, improving communication, and ensuring robust follow-through on corrective actions.

Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.

A comprehensive, blameless post-mortem document identifying root causes, contributing factors, lessons learned, and clear action items to prevent future incidents.

Practical example

Example input

Run a post-mortem for the recent database outage. The incident ID is INC-2023-08-15-001, it started at 10:00 UTC and ended at 12:30 UTC. Severity was SEV-1, affecting all users. Detection was via customer reports. Please guide me through the process.

Example output

Okay, let's start with the Incident Summary. 

**1. Incident Summary**
Incident ID: INC-2023-08-15-001
Date/Time: 10:00 UTC → 12:30 UTC (Duration: 2.5 hours)
Severity: SEV-1 (revenue impact)
Impact: All users affected, [ask for revenue lost], SLA breached Y.
Detection: Customer report
Detection Delay: [ask for time]

Now, let's compile the timeline...

When to use this skill

After any production incident, outage, or service degradation
After a missed deadline, failed launch, or lost deal
After any event costing >$5K or >4 hours of team time
Quarterly review of recurring incident patterns

When not to use this skill

For minor, non-impactful issues that don't require formal documentation
When immediate incident resolution is the sole priority and documentation can follow later
If the goal is simply to log an event without deeper analysis or action planning
When a real-time incident *management* tool is needed, rather than a review framework

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/afrexai-post-mortem/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/1kalin/afrexai-post-mortem/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/afrexai-post-mortem/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Post-Mortem & Incident Review Framework Compares

Feature / Agent	Post-Mortem & Incident Review Framework	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	easy	N/A

Frequently Asked Questions

What does this skill do?

Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

Top AI Agents for Productivity

See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.

SKILL.md Source

# Post-Mortem & Incident Review Framework

Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.

## When to Use
- After any production incident, outage, or service degradation
- After a missed deadline, failed launch, or lost deal
- After any event costing >$5K or >4 hours of team time
- Quarterly review of recurring incident patterns

## Post-Mortem Template

### 1. Incident Summary (Complete Within 24 Hours)
```
Incident ID: [AUTO-GENERATED]
Date/Time: [Start] → [End] (Duration: X hours)
Severity: SEV-1 (revenue impact) | SEV-2 (customer impact) | SEV-3 (internal impact)
Impact: [Users affected] | [Revenue lost] | [SLA breached Y/N]
Detection: How was it found? (Monitoring / Customer report / Internal discovery)
Detection Delay: Time from incident start → first alert
```

### 2. Timeline (Minute-by-Minute for SEV-1, 15-min blocks for SEV-2/3)
```
HH:MM - Event description
HH:MM - First alert triggered
HH:MM - Team notified
HH:MM - Investigation started
HH:MM - Root cause identified
HH:MM - Fix deployed
HH:MM - Confirmed resolved
```

### 3. Root Cause Analysis — 5 Whys
```
Why 1: [Direct cause]
Why 2: [Why did that happen?]
Why 3: [Why did THAT happen?]
Why 4: [Systemic cause]
Why 5: [Organizational/cultural root]
```

### 4. Contributing Factors
Score each factor 0-3 (0=not a factor, 3=primary contributor):

| Factor | Score | Notes |
|---|---|---|
| Missing/inadequate monitoring | | |
| Insufficient testing | | |
| Documentation gaps | | |
| Process not followed | | |
| Knowledge concentration (bus factor) | | |
| Capacity/scaling limits | | |
| Third-party dependency | | |
| Communication breakdown | | |
| Change management failure | | |
| Technical debt | | |

### 5. What Went Well
List 3-5 things that worked during the response:
- Fast detection? Good runbooks? Strong communication? Quick escalation?

### 6. Action Items
Every action MUST have an owner and deadline:

| # | Action | Owner | Deadline | Priority | Status |
|---|---|---|---|---|---|
| 1 | | | | P0/P1/P2 | Open |

**Priority definitions:**
- P0: Must complete before next business day
- P1: Must complete within 1 week
- P2: Must complete within 1 sprint/month

### 7. Recurrence Prevention
- [ ] Monitoring added/improved for this failure mode
- [ ] Runbook created/updated
- [ ] Test coverage added
- [ ] Architecture change needed? (If yes, create RFC)
- [ ] Training needed for team?

## Blameless Post-Mortem Rules
1. Focus on systems, not individuals
2. "What happened" not "who did it"
3. Assume everyone acted with best intentions and available information
4. The goal is learning, not punishment
5. If you find yourself writing someone's name next to a mistake, rewrite it as a process gap

## Incident Cost Calculator
```
Direct costs:
  Revenue lost during downtime: $___
  SLA credits issued: $___
  Emergency vendor/contractor costs: $___

Indirect costs:
  Engineering hours × loaded rate: ___ hrs × $___/hr = $___
  Customer churn risk (affected users × churn probability × LTV): $___
  Brand/reputation (estimate): $___

Total incident cost: $___
Cost per minute of downtime: $___
```

## Quarterly Incident Review
Every quarter, analyze patterns across all post-mortems:

1. **Top 3 root cause categories** — Where should you invest in prevention?
2. **Mean time to detect (MTTD)** — Is monitoring improving?
3. **Mean time to resolve (MTTR)** — Is response getting faster?
4. **Action item completion rate** — Are you actually fixing things?
5. **Repeat incidents** — Same root cause twice = systemic failure
6. **Cost trend** — Total incident cost per quarter (should decrease)

## Industry-Specific Post-Mortem Considerations

| Industry | Key Focus | Regulatory Requirement |
|---|---|---|
| Fintech | Transaction integrity, audit trail | SOX, PCI-DSS incident reporting |
| Healthcare | PHI exposure, patient safety | HIPAA breach notification (60 days) |
| SaaS | SLA compliance, data integrity | SOC 2 incident management |
| E-commerce | Order integrity, payment processing | PCI-DSS, consumer protection |
| Manufacturing | Safety incidents, production loss | OSHA reporting requirements |

---

## Go Deeper

Your post-mortems reveal where AI agents should be deployed first — the repetitive failures, the manual monitoring gaps, the processes that break under load.

- **Find your highest-cost gaps:** [AI Revenue Leak Calculator](https://afrexai-cto.github.io/ai-revenue-calculator/)
- **Industry-specific deployment playbooks:** [AfrexAI Context Packs — $47](https://afrexai-cto.github.io/context-packs/)
  - Pick 3: $97 | All 10: $197 | Everything: $247
- **Deploy your first agent:** [Agent Setup Wizard](https://afrexai-cto.github.io/agent-setup/)

*Built by [AfrexAI](https://afrexai-cto.github.io/context-packs/) — turning incident patterns into automation opportunities.*

Related Skills

Incident Postmortem Generator

from openclaw/skills

Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.

DevOps & Infrastructure

Incident Response Playbook

from openclaw/skills

Structured incident response for business and IT teams. Guides you through detection, triage, containment, resolution, and post-mortem — with auto-generated timelines and action items.

DevOps & Infrastructure

botlearn-healthcheck

from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure

afrexai-performance-engineering

from openclaw/skills

Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applications, optimizing code/queries/infrastructure, load testing before launch, planning capacity, or building performance into CI/CD. Covers Node.js, Python, Go, Java, databases, APIs, and frontend.

DevOps & Infrastructure

OpenClaw Mastery — The Complete Agent Engineering & Operations System

from openclaw/skills

> Built by AfrexAI — the team that runs 9+ production agents 24/7 on OpenClaw.

DevOps & Infrastructure

Legacy System Modernization Engine

from openclaw/skills

Complete methodology for assessing, planning, and executing legacy system modernization — from monolith decomposition to cloud migration. Works for any tech stack, any scale.

DevOps & Infrastructure

Git Engineering & Repository Strategy

from openclaw/skills

You are a Git Engineering expert. You help teams design branching strategies, implement code review workflows, manage monorepos, automate releases, and maintain healthy repository practices at scale.

DevOps & Infrastructure

Django Production Engineering

from openclaw/skills

Complete methodology for building, scaling, and operating production Django applications. From project structure to deployment, security to performance — every decision framework a Django team needs.

DevOps & Infrastructure

IT Disaster Recovery Plan Generator

from openclaw/skills

Build production-ready disaster recovery plans that actually get followed when things break.

DevOps & Infrastructure

afrexai-api-architect

from openclaw/skills

Design, build, test, document, and secure production-grade APIs. Covers the full lifecycle from schema design through deployment, monitoring, and versioning. Use when designing new APIs, reviewing existing ones, generating OpenAPI specs, building test suites, or debugging production issues.

DevOps & Infrastructure

Agent Ops Runbook

from openclaw/skills

Generate a production-ready operations runbook for deploying AI agents. Covers pre-deployment checklists, shadow mode → supervised → autonomous rollout stages, monitoring dashboards, rollback procedures, cost management, and incident response templates.

DevOps & Infrastructure

node-red-manager

from openclaw/skills

Manage Node-RED instances via Admin API or CLI. Automate flow deployment, install nodes, and troubleshoot issues. Use when user wants to "build automation", "connect devices", or "fix node-red".

DevOps & Infrastructure