Post-Mortem & Incident Review Framework
Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.
About this skill
This AI agent skill provides a structured framework for conducting thorough post-mortems and incident reviews. It guides the user or the agent through a comprehensive process, covering incident summary, detailed timelines, 5 Whys root cause analysis, identification of contributing factors, acknowledgment of successes, and the generation of actionable items with owners and deadlines. The primary goal is to move beyond mere incident reporting to deep analytical reviews that identify systemic issues rather than individual blame. By following this framework, organizations can systematically learn from incidents, outages, missed deadlines, or project failures, ensuring that corrective actions are put in place to prevent recurrence. Users benefit from a standardized approach to incident response and review, fostering a culture of continuous improvement and accountability. It helps transform complex incident data into clear, actionable insights, leading to more resilient systems and processes.
Best use case
The primary use case is for engineering teams, SREs, project managers, and organizational leaders who need to conduct systematic reviews after adverse events. It's invaluable for transforming chaotic incident data into an organized, analytical report that drives learning and prevents repeat failures. Teams benefit by standardizing their incident response and review process, improving communication, and ensuring robust follow-through on corrective actions.
Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.
A comprehensive, blameless post-mortem document identifying root causes, contributing factors, lessons learned, and clear action items to prevent future incidents.
Practical example
Example input
Run a post-mortem for the recent database outage. The incident ID is INC-2023-08-15-001, it started at 10:00 UTC and ended at 12:30 UTC. Severity was SEV-1, affecting all users. Detection was via customer reports. Please guide me through the process.
Example output
Okay, let's start with the Incident Summary. **1. Incident Summary** Incident ID: INC-2023-08-15-001 Date/Time: 10:00 UTC → 12:30 UTC (Duration: 2.5 hours) Severity: SEV-1 (revenue impact) Impact: All users affected, [ask for revenue lost], SLA breached Y. Detection: Customer report Detection Delay: [ask for time] Now, let's compile the timeline...
When to use this skill
- After any production incident, outage, or service degradation
- After a missed deadline, failed launch, or lost deal
- After any event costing >$5K or >4 hours of team time
- Quarterly review of recurring incident patterns
When not to use this skill
- For minor, non-impactful issues that don't require formal documentation
- When immediate incident resolution is the sole priority and documentation can follow later
- If the goal is simply to log an event without deeper analysis or action planning
- When a real-time incident *management* tool is needed, rather than a review framework
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/afrexai-post-mortem/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Post-Mortem & Incident Review Framework Compares
| Feature / Agent | Post-Mortem & Incident Review Framework | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | easy | N/A |
Frequently Asked Questions
What does this skill do?
Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.
How difficult is it to install?
The installation complexity is rated as easy. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
SKILL.md Source
# Post-Mortem & Incident Review Framework Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking. ## When to Use - After any production incident, outage, or service degradation - After a missed deadline, failed launch, or lost deal - After any event costing >$5K or >4 hours of team time - Quarterly review of recurring incident patterns ## Post-Mortem Template ### 1. Incident Summary (Complete Within 24 Hours) ``` Incident ID: [AUTO-GENERATED] Date/Time: [Start] → [End] (Duration: X hours) Severity: SEV-1 (revenue impact) | SEV-2 (customer impact) | SEV-3 (internal impact) Impact: [Users affected] | [Revenue lost] | [SLA breached Y/N] Detection: How was it found? (Monitoring / Customer report / Internal discovery) Detection Delay: Time from incident start → first alert ``` ### 2. Timeline (Minute-by-Minute for SEV-1, 15-min blocks for SEV-2/3) ``` HH:MM - Event description HH:MM - First alert triggered HH:MM - Team notified HH:MM - Investigation started HH:MM - Root cause identified HH:MM - Fix deployed HH:MM - Confirmed resolved ``` ### 3. Root Cause Analysis — 5 Whys ``` Why 1: [Direct cause] Why 2: [Why did that happen?] Why 3: [Why did THAT happen?] Why 4: [Systemic cause] Why 5: [Organizational/cultural root] ``` ### 4. Contributing Factors Score each factor 0-3 (0=not a factor, 3=primary contributor): | Factor | Score | Notes | |---|---|---| | Missing/inadequate monitoring | | | | Insufficient testing | | | | Documentation gaps | | | | Process not followed | | | | Knowledge concentration (bus factor) | | | | Capacity/scaling limits | | | | Third-party dependency | | | | Communication breakdown | | | | Change management failure | | | | Technical debt | | | ### 5. What Went Well List 3-5 things that worked during the response: - Fast detection? Good runbooks? Strong communication? Quick escalation? ### 6. Action Items Every action MUST have an owner and deadline: | # | Action | Owner | Deadline | Priority | Status | |---|---|---|---|---|---| | 1 | | | | P0/P1/P2 | Open | **Priority definitions:** - P0: Must complete before next business day - P1: Must complete within 1 week - P2: Must complete within 1 sprint/month ### 7. Recurrence Prevention - [ ] Monitoring added/improved for this failure mode - [ ] Runbook created/updated - [ ] Test coverage added - [ ] Architecture change needed? (If yes, create RFC) - [ ] Training needed for team? ## Blameless Post-Mortem Rules 1. Focus on systems, not individuals 2. "What happened" not "who did it" 3. Assume everyone acted with best intentions and available information 4. The goal is learning, not punishment 5. If you find yourself writing someone's name next to a mistake, rewrite it as a process gap ## Incident Cost Calculator ``` Direct costs: Revenue lost during downtime: $___ SLA credits issued: $___ Emergency vendor/contractor costs: $___ Indirect costs: Engineering hours × loaded rate: ___ hrs × $___/hr = $___ Customer churn risk (affected users × churn probability × LTV): $___ Brand/reputation (estimate): $___ Total incident cost: $___ Cost per minute of downtime: $___ ``` ## Quarterly Incident Review Every quarter, analyze patterns across all post-mortems: 1. **Top 3 root cause categories** — Where should you invest in prevention? 2. **Mean time to detect (MTTD)** — Is monitoring improving? 3. **Mean time to resolve (MTTR)** — Is response getting faster? 4. **Action item completion rate** — Are you actually fixing things? 5. **Repeat incidents** — Same root cause twice = systemic failure 6. **Cost trend** — Total incident cost per quarter (should decrease) ## Industry-Specific Post-Mortem Considerations | Industry | Key Focus | Regulatory Requirement | |---|---|---| | Fintech | Transaction integrity, audit trail | SOX, PCI-DSS incident reporting | | Healthcare | PHI exposure, patient safety | HIPAA breach notification (60 days) | | SaaS | SLA compliance, data integrity | SOC 2 incident management | | E-commerce | Order integrity, payment processing | PCI-DSS, consumer protection | | Manufacturing | Safety incidents, production loss | OSHA reporting requirements | --- ## Go Deeper Your post-mortems reveal where AI agents should be deployed first — the repetitive failures, the manual monitoring gaps, the processes that break under load. - **Find your highest-cost gaps:** [AI Revenue Leak Calculator](https://afrexai-cto.github.io/ai-revenue-calculator/) - **Industry-specific deployment playbooks:** [AfrexAI Context Packs — $47](https://afrexai-cto.github.io/context-packs/) - Pick 3: $97 | All 10: $197 | Everything: $247 - **Deploy your first agent:** [Agent Setup Wizard](https://afrexai-cto.github.io/agent-setup/) *Built by [AfrexAI](https://afrexai-cto.github.io/context-packs/) — turning incident patterns into automation opportunities.*
Related Skills
Incident Postmortem Generator
Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.
Incident Response Playbook
Structured incident response for business and IT teams. Guides you through detection, triage, containment, resolution, and post-mortem — with auto-generated timelines and action items.
botlearn-healthcheck
botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.
afrexai-performance-engineering
Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applications, optimizing code/queries/infrastructure, load testing before launch, planning capacity, or building performance into CI/CD. Covers Node.js, Python, Go, Java, databases, APIs, and frontend.
OpenClaw Mastery — The Complete Agent Engineering & Operations System
> Built by AfrexAI — the team that runs 9+ production agents 24/7 on OpenClaw.
Legacy System Modernization Engine
Complete methodology for assessing, planning, and executing legacy system modernization — from monolith decomposition to cloud migration. Works for any tech stack, any scale.
Git Engineering & Repository Strategy
You are a Git Engineering expert. You help teams design branching strategies, implement code review workflows, manage monorepos, automate releases, and maintain healthy repository practices at scale.
Django Production Engineering
Complete methodology for building, scaling, and operating production Django applications. From project structure to deployment, security to performance — every decision framework a Django team needs.
IT Disaster Recovery Plan Generator
Build production-ready disaster recovery plans that actually get followed when things break.
afrexai-api-architect
Design, build, test, document, and secure production-grade APIs. Covers the full lifecycle from schema design through deployment, monitoring, and versioning. Use when designing new APIs, reviewing existing ones, generating OpenAPI specs, building test suites, or debugging production issues.
Agent Ops Runbook
Generate a production-ready operations runbook for deploying AI agents. Covers pre-deployment checklists, shadow mode → supervised → autonomous rollout stages, monitoring dashboards, rollback procedures, cost management, and incident response templates.
node-red-manager
Manage Node-RED instances via Admin API or CLI. Automate flow deployment, install nodes, and troubleshoot issues. Use when user wants to "build automation", "connect devices", or "fix node-red".