Incident Postmortem Generator

Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.

3,891 stars

Complexity: easy

About this skill

This AI agent skill automates the creation of comprehensive incident postmortem documents. It takes raw, messy incident details and transforms them into a professional, markdown-formatted report. The skill guides the AI to produce a structured document including an Executive Summary, detailed Timeline, 5 Whys Root Cause Analysis, Impact Assessment, prioritized Action Items, Lessons Learned, and Prevention Measures. Users benefit from consistent, high-quality incident documentation that follows established industry standards (like Google/Atlassian SRE best practices) without requiring manual formatting or adherence to a template. It ensures all critical aspects of an incident are covered, promotes a blameless culture, and helps teams learn from outages efficiently. It's designed to streamline the post-incident review process, reducing the time and effort typically spent on drafting these crucial documents. By automating the initial generation, it allows human teams to focus more on analysis, validation, and implementing preventative measures rather than tedious documentation.

Best use case

The primary use case for this skill is to assist engineering teams, Site Reliability Engineers (SREs), and DevOps professionals in quickly and consistently documenting production incidents. It's ideal for situations where an incident has occurred, and the team needs to generate a formal postmortem report to understand what happened, its impact, and what actions are needed to prevent recurrence. This skill helps ensure that all necessary information is captured in a structured, blameless format, facilitating effective incident reviews and continuous improvement.

Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.

A comprehensive, structured, and blameless incident postmortem document in markdown format, complete with an executive summary, timeline, root cause analysis, impact assessment, action items, lessons learned, and prevention measures.

Practical example

Example input

prod went down at 2pm, bad deploy, rolled back at 2:45, ~500 users couldn't checkout, lost maybe $12k revenue

Example output

```markdown
# Incident Postmortem: Checkout Service Outage
**Date:** [Generated Date, e.g., 2024-03-15] | **Severity:** SEV1 | **Duration:** 45 minutes
**Author:** [Team Lead Placeholder]

## Executive Summary
On [Generated Date], the production checkout service experienced an outage lasting 45 minutes (14:00-14:45 UTC) due to a faulty deployment. Approximately 500 users were unable to complete purchases, resulting in an estimated revenue loss of $12,000. The issue was resolved by rolling back the problematic deployment.

## Timeline
- **[Generated Date] 14:00 UTC:** Production checkout service went down following a recent deployment.
- **[Generated Date] 14:15 UTC:** Diagnosis identified a 'bad deploy' as the root cause.
- **[Generated Date] 14:30 UTC:** Rollback procedure initiated.
- **[Generated Date] 14:45 UTC:** Service fully restored after successful rollback.

## Root Cause Analysis (5 Whys)
1. **Why did the checkout service go down?** A recent deployment introduced a critical bug.
2. **Why was the bug deployed to production?** [Investigation required to determine testing gaps or process failures].
3. **Why did the deployment process allow a buggy release?** [Investigation required to assess automated checks or human review].
4. **Why weren't issues detected pre-deployment?** [Investigation required into testing environment or coverage].
5. **Why were the previous safeguards insufficient?** [Investigation required into overall deployment strategy].

## Impact Assessment
- **Users Affected:** Approximately 500 users could not complete checkout.
- **Revenue Impact:** Estimated $12,000 lost revenue.
- **SLA Breach:** SEV1 incident, confirming a breach of service availability SLA.

## Action Items
- Review and enhance pre-deployment testing for the checkout service. **Owner:** [Placeholder], **Deadline:** [Placeholder], **Priority:** P0
- Implement automated canary deployments for critical services. **Owner:** [Placeholder], **Deadline:** [Placeholder], **Priority:** P1

## Lessons Learned
- **What worked:** Quick detection and efficient rollback minimized incident duration.
- **What didn't:** Insufficient pre-deployment validation allowed a breaking change to reach production.
- **What was lucky:** No data corruption occurred during the outage.

## Prevention Measures
1. Implement mandatory automated integration tests for critical user flows (High Effort/High Impact)
2. Establish a phased rollout strategy for all production deployments (Medium Effort/High Impact)
3. Conduct a post-mortem review of the deployment pipeline to identify failure points (Medium Effort/Medium Impact)
```

When to use this skill

Immediately after resolving a production incident or outage.
When you need a structured, blameless postmortem document quickly.
To ensure consistent incident documentation practices across a team or organization.
When working with fragmented incident details from chat logs, notes, or verbal accounts.

When not to use this skill

For trivial issues that do not warrant a formal incident postmortem.
When a human expert's in-depth, real-time investigative judgment is exclusively required without AI assistance.
For drafting highly sensitive compliance or legal documents without direct human oversight and editing.
If the primary goal is real-time incident response or troubleshooting, rather than post-incident documentation.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/afrexai-postmortem/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/1kalin/afrexai-postmortem/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/afrexai-postmortem/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Incident Postmortem Generator Compares

Feature / Agent	Incident Postmortem Generator	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	easy	N/A

Frequently Asked Questions

What does this skill do?

Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Top AI Agents for Productivity

See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

SKILL.md Source

# Incident Postmortem Generator

Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.

## What It Does
Takes messy incident details and produces a structured postmortem document following Google/Atlassian SRE best practices.

## Usage
Provide incident details in any format — timeline bullets, Slack copy-paste, verbal notes — and the agent will produce:

1. **Executive Summary** — What happened, impact, duration, severity
2. **Timeline** — Minute-by-minute from detection to resolution
3. **Root Cause Analysis** — 5 Whys format, no finger-pointing
4. **Impact Assessment** — Users affected, revenue impact, SLA breach
5. **Action Items** — Prioritized fixes with owners and deadlines
6. **Lessons Learned** — What worked, what didn't, what was lucky
7. **Prevention Measures** — Systemic changes to prevent recurrence

## Instructions

When the user provides incident details:

1. Ask clarifying questions ONLY if critical info is missing (severity, duration, or resolution are the minimum)
2. Generate the full postmortem in markdown
3. Flag any gaps the team should fill in before publishing
4. Suggest 3-5 specific, actionable prevention measures ranked by effort/impact

### Formatting Rules
- Use ISO timestamps in timeline
- Bold severity level (SEV1-SEV4)
- Action items must have: description, owner placeholder, deadline placeholder, priority (P0-P3)
- Keep language blameless — "the deploy process" not "Bob deployed"

### Severity Guide
- **SEV1**: Revenue-impacting, all users affected, >1hr
- **SEV2**: Major feature down, >30% users, >30min
- **SEV3**: Degraded performance, <30% users
- **SEV4**: Minor issue, workaround available

## Example Input
```
prod went down at 2pm, bad deploy, rolled back at 2:45, ~500 users couldn't checkout, lost maybe $12k revenue
```

## Example Output Structure
```markdown
# Incident Postmortem: Checkout Service Outage
**Date:** 2026-02-22 | **Severity:** SEV1 | **Duration:** 45 minutes
**Author:** [Team Lead] | **Status:** Draft

## Executive Summary
...
```

3891

from openclaw/skills

Manage Node-RED instances via Admin API or CLI. Automate flow deployment, install nodes, and troubleshoot issues. Use when user wants to "build automation", "connect devices", or "fix node-red".

DevOps & Infrastructure