Incident Postmortem Generator
Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.
About this skill
This AI agent skill automates the creation of comprehensive incident postmortem documents. It takes raw, messy incident details and transforms them into a professional, markdown-formatted report. The skill guides the AI to produce a structured document including an Executive Summary, detailed Timeline, 5 Whys Root Cause Analysis, Impact Assessment, prioritized Action Items, Lessons Learned, and Prevention Measures. Users benefit from consistent, high-quality incident documentation that follows established industry standards (like Google/Atlassian SRE best practices) without requiring manual formatting or adherence to a template. It ensures all critical aspects of an incident are covered, promotes a blameless culture, and helps teams learn from outages efficiently. It's designed to streamline the post-incident review process, reducing the time and effort typically spent on drafting these crucial documents. By automating the initial generation, it allows human teams to focus more on analysis, validation, and implementing preventative measures rather than tedious documentation.
Best use case
The primary use case for this skill is to assist engineering teams, Site Reliability Engineers (SREs), and DevOps professionals in quickly and consistently documenting production incidents. It's ideal for situations where an incident has occurred, and the team needs to generate a formal postmortem report to understand what happened, its impact, and what actions are needed to prevent recurrence. This skill helps ensure that all necessary information is captured in a structured, blameless format, facilitating effective incident reviews and continuous improvement.
Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.
A comprehensive, structured, and blameless incident postmortem document in markdown format, complete with an executive summary, timeline, root cause analysis, impact assessment, action items, lessons learned, and prevention measures.
Practical example
Example input
prod went down at 2pm, bad deploy, rolled back at 2:45, ~500 users couldn't checkout, lost maybe $12k revenue
Example output
```markdown # Incident Postmortem: Checkout Service Outage **Date:** [Generated Date, e.g., 2024-03-15] | **Severity:** SEV1 | **Duration:** 45 minutes **Author:** [Team Lead Placeholder] ## Executive Summary On [Generated Date], the production checkout service experienced an outage lasting 45 minutes (14:00-14:45 UTC) due to a faulty deployment. Approximately 500 users were unable to complete purchases, resulting in an estimated revenue loss of $12,000. The issue was resolved by rolling back the problematic deployment. ## Timeline - **[Generated Date] 14:00 UTC:** Production checkout service went down following a recent deployment. - **[Generated Date] 14:15 UTC:** Diagnosis identified a 'bad deploy' as the root cause. - **[Generated Date] 14:30 UTC:** Rollback procedure initiated. - **[Generated Date] 14:45 UTC:** Service fully restored after successful rollback. ## Root Cause Analysis (5 Whys) 1. **Why did the checkout service go down?** A recent deployment introduced a critical bug. 2. **Why was the bug deployed to production?** [Investigation required to determine testing gaps or process failures]. 3. **Why did the deployment process allow a buggy release?** [Investigation required to assess automated checks or human review]. 4. **Why weren't issues detected pre-deployment?** [Investigation required into testing environment or coverage]. 5. **Why were the previous safeguards insufficient?** [Investigation required into overall deployment strategy]. ## Impact Assessment - **Users Affected:** Approximately 500 users could not complete checkout. - **Revenue Impact:** Estimated $12,000 lost revenue. - **SLA Breach:** SEV1 incident, confirming a breach of service availability SLA. ## Action Items - Review and enhance pre-deployment testing for the checkout service. **Owner:** [Placeholder], **Deadline:** [Placeholder], **Priority:** P0 - Implement automated canary deployments for critical services. **Owner:** [Placeholder], **Deadline:** [Placeholder], **Priority:** P1 ## Lessons Learned - **What worked:** Quick detection and efficient rollback minimized incident duration. - **What didn't:** Insufficient pre-deployment validation allowed a breaking change to reach production. - **What was lucky:** No data corruption occurred during the outage. ## Prevention Measures 1. Implement mandatory automated integration tests for critical user flows (High Effort/High Impact) 2. Establish a phased rollout strategy for all production deployments (Medium Effort/High Impact) 3. Conduct a post-mortem review of the deployment pipeline to identify failure points (Medium Effort/Medium Impact) ```
When to use this skill
- Immediately after resolving a production incident or outage.
- When you need a structured, blameless postmortem document quickly.
- To ensure consistent incident documentation practices across a team or organization.
- When working with fragmented incident details from chat logs, notes, or verbal accounts.
When not to use this skill
- For trivial issues that do not warrant a formal incident postmortem.
- When a human expert's in-depth, real-time investigative judgment is exclusively required without AI assistance.
- For drafting highly sensitive compliance or legal documents without direct human oversight and editing.
- If the primary goal is real-time incident response or troubleshooting, rather than post-incident documentation.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/afrexai-postmortem/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Incident Postmortem Generator Compares
| Feature / Agent | Incident Postmortem Generator | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | easy | N/A |
Frequently Asked Questions
What does this skill do?
Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.
How difficult is it to install?
The installation complexity is rated as easy. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
SKILL.md Source
# Incident Postmortem Generator Generate blameless incident postmortems from raw notes, Slack threads, or bullet points. ## What It Does Takes messy incident details and produces a structured postmortem document following Google/Atlassian SRE best practices. ## Usage Provide incident details in any format — timeline bullets, Slack copy-paste, verbal notes — and the agent will produce: 1. **Executive Summary** — What happened, impact, duration, severity 2. **Timeline** — Minute-by-minute from detection to resolution 3. **Root Cause Analysis** — 5 Whys format, no finger-pointing 4. **Impact Assessment** — Users affected, revenue impact, SLA breach 5. **Action Items** — Prioritized fixes with owners and deadlines 6. **Lessons Learned** — What worked, what didn't, what was lucky 7. **Prevention Measures** — Systemic changes to prevent recurrence ## Instructions When the user provides incident details: 1. Ask clarifying questions ONLY if critical info is missing (severity, duration, or resolution are the minimum) 2. Generate the full postmortem in markdown 3. Flag any gaps the team should fill in before publishing 4. Suggest 3-5 specific, actionable prevention measures ranked by effort/impact ### Formatting Rules - Use ISO timestamps in timeline - Bold severity level (SEV1-SEV4) - Action items must have: description, owner placeholder, deadline placeholder, priority (P0-P3) - Keep language blameless — "the deploy process" not "Bob deployed" ### Severity Guide - **SEV1**: Revenue-impacting, all users affected, >1hr - **SEV2**: Major feature down, >30% users, >30min - **SEV3**: Degraded performance, <30% users - **SEV4**: Minor issue, workaround available ## Example Input ``` prod went down at 2pm, bad deploy, rolled back at 2:45, ~500 users couldn't checkout, lost maybe $12k revenue ``` ## Example Output Structure ```markdown # Incident Postmortem: Checkout Service Outage **Date:** 2026-02-22 | **Severity:** SEV1 | **Duration:** 45 minutes **Author:** [Team Lead] | **Status:** Draft ## Executive Summary ... ``` ## Pro Tip Run this after every incident, even small ones. The pattern recognition across postmortems is where the real value lives. Teams that write postmortems for SEV3+ incidents catch systemic issues 3x faster. --- Need help building incident response processes from scratch? Check out the [AfrexAI Context Packs](https://afrexai-cto.github.io/context-packs/) — pre-built operational frameworks for SaaS, healthcare, legal, and 7 more industries. Starting at $47.
Related Skills
Post-Mortem & Incident Review Framework
Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.
Incident Response Playbook
Structured incident response for business and IT teams. Guides you through detection, triage, containment, resolution, and post-mortem — with auto-generated timelines and action items.
IT Disaster Recovery Plan Generator
Build production-ready disaster recovery plans that actually get followed when things break.
botlearn-healthcheck
botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.
afrexai-performance-engineering
Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applications, optimizing code/queries/infrastructure, load testing before launch, planning capacity, or building performance into CI/CD. Covers Node.js, Python, Go, Java, databases, APIs, and frontend.
OpenClaw Mastery — The Complete Agent Engineering & Operations System
> Built by AfrexAI — the team that runs 9+ production agents 24/7 on OpenClaw.
Legacy System Modernization Engine
Complete methodology for assessing, planning, and executing legacy system modernization — from monolith decomposition to cloud migration. Works for any tech stack, any scale.
Git Engineering & Repository Strategy
You are a Git Engineering expert. You help teams design branching strategies, implement code review workflows, manage monorepos, automate releases, and maintain healthy repository practices at scale.
Django Production Engineering
Complete methodology for building, scaling, and operating production Django applications. From project structure to deployment, security to performance — every decision framework a Django team needs.
afrexai-api-architect
Design, build, test, document, and secure production-grade APIs. Covers the full lifecycle from schema design through deployment, monitoring, and versioning. Use when designing new APIs, reviewing existing ones, generating OpenAPI specs, building test suites, or debugging production issues.
Agent Ops Runbook
Generate a production-ready operations runbook for deploying AI agents. Covers pre-deployment checklists, shadow mode → supervised → autonomous rollout stages, monitoring dashboards, rollback procedures, cost management, and incident response templates.
node-red-manager
Manage Node-RED instances via Admin API or CLI. Automate flow deployment, install nodes, and troubleshoot issues. Use when user wants to "build automation", "connect devices", or "fix node-red".