postmortem-writer

Creates comprehensive post-incident documents with timeline, root cause analysis, contributing factors, action items, and ownership. Follows SRE best practices for blameless postmortems. Use for "postmortem", "incident review", "RCA", or "post-incident".

33 stars

Best use case

postmortem-writer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Creates comprehensive post-incident documents with timeline, root cause analysis, contributing factors, action items, and ownership. Follows SRE best practices for blameless postmortems. Use for "postmortem", "incident review", "RCA", or "post-incident".

Teams using postmortem-writer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/postmortem-writer/SKILL.md --create-dirs "https://raw.githubusercontent.com/aAAaqwq/AGI-Super-Team/main/skills/postmortem-writer/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/postmortem-writer/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How postmortem-writer Compares

Feature / Agentpostmortem-writerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Creates comprehensive post-incident documents with timeline, root cause analysis, contributing factors, action items, and ownership. Follows SRE best practices for blameless postmortems. Use for "postmortem", "incident review", "RCA", or "post-incident".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Postmortem Writer

Document incidents for learning and improvement.

## Postmortem Template

```markdown
# Postmortem: API Outage - Database Connection Pool Exhausted

**Date:** 2024-01-15
**Authors:** Jane Doe (On-call), John Smith (DBA)
**Status:** Complete
**Severity:** P1 (Critical)

## Summary

On January 15, 2024, our API experienced a complete outage for 25 minutes (14:32 - 14:57 UTC) affecting 100% of users. The root cause was database connection pool exhaustion triggered by a connection leak introduced in deployment v2.3.4.

**Impact:**

- Duration: 25 minutes
- Users affected: ~50,000
- Requests failed: ~125,000
- Revenue impact: ~$15,000

## Timeline (All times UTC)

| Time  | Event                                            |
| ----- | ------------------------------------------------ |
| 14:15 | v2.3.4 deployed to production                    |
| 14:32 | First CloudWatch alarm: HighErrorRate            |
| 14:33 | PagerDuty alert sent to on-call (Jane)           |
| 14:35 | Jane acknowledges, begins investigation          |
| 14:38 | Identified: Database connection pool at 100%     |
| 14:40 | Attempted: Kill long-running queries (no effect) |
| 14:43 | Decision: Rollback to v2.3.3                     |
| 14:45 | Rollback initiated                               |
| 14:47 | Rollback complete, connections dropping          |
| 14:50 | Error rate returning to normal                   |
| 14:57 | All systems recovered, incident closed           |
| 15:30 | Postmortem meeting scheduled                     |

## Root Cause

A code change in v2.3.4 introduced a connection leak in the user profile endpoint. The new caching layer was not properly releasing database connections after queries completed.

**Code diff:**
\`\`\`diff

- await prisma.user.findUnique({ where: { id } });

* const client = await pool.connect();
* const user = await client.query('SELECT \* FROM users WHERE id = $1', [id]);
* // Missing: client.release() ❌
  \`\`\`

## Contributing Factors

1. **Insufficient testing:** Load tests didn't catch the leak

   - Tests only ran for 5 minutes
   - Not enough concurrent connections to exhaust pool

2. **Missing monitoring:** No alerts on connection pool metrics

   - Had alarms for query latency
   - No alarms for active connections count

3. **Inadequate code review:** Reviewer didn't spot missing release()

   - PR approved without running locally
   - No checklist for connection management

4. **Deployment process:** No gradual rollout
   - Deployed to 100% of production immediately
   - No canary deployment

## What Went Well

1. ✅ **Fast detection:** Alert fired within 3 minutes
2. ✅ **Clear runbook:** DBA runbook had exact steps to follow
3. ✅ **Quick decision:** Made rollback decision in 8 minutes
4. ✅ **Communication:** Status page updated every 5 minutes
5. ✅ **Rollback capability:** Automated rollback took <2 minutes

## What Went Wrong

1. ❌ **Code review missed bug:** Connection leak not caught
2. ❌ **Testing gaps:** Load tests insufficient duration
3. ❌ **No canary:** Deployed to all instances at once
4. ❌ **Late detection:** 17 minutes between deploy and alert

## Action Items

| Action                                        | Owner   | Due Date   | Priority | Status         |
| --------------------------------------------- | ------- | ---------- | -------- | -------------- |
| Add connection pool metrics to dashboards     | Jane    | 2024-01-20 | P0       | ✅ Done        |
| Create PR checklist for connection management | John    | 2024-01-22 | P0       | ✅ Done        |
| Extend load tests to 30 minutes minimum       | QA Team | 2024-01-25 | P1       | 🔄 In Progress |
| Implement canary deployment (10% → 100%)      | DevOps  | 2024-02-01 | P1       | 📋 Planned     |
| Add connection leak detection to tests        | Jane    | 2024-01-27 | P1       | 🔄 In Progress |
| Review all DB connection usage patterns       | John    | 2024-02-05 | P2       | 📋 Planned     |
| Improve alert routing (faster escalation)     | DevOps  | 2024-02-10 | P2       | 📋 Planned     |

## Lessons Learned

1. **Code review checklists work:** Need specific items for common issues
2. **Load tests need realistic duration:** 5min insufficient for leaks
3. **Gradual rollouts catch issues:** 10% canary would have limited impact
4. **Monitoring gaps are dangerous:** Add metrics before you need them
5. **Runbooks save time:** Clear procedures enabled fast response

## Related Incidents

- [2023-11-20] Database CPU spike (similar connection pool issue)
- [2023-08-15] Memory leak in cache layer

## Prevention

To prevent similar incidents:

1. ✅ Add connection management to code review checklist
2. ✅ Monitor connection pool utilization
3. ✅ Extend load test duration
4. ✅ Implement canary deployments
5. ✅ Add automated connection leak detection

## Appendix

### Monitoring Graphs

[Insert graphs of connection pool, error rates, latency during incident]

### Communication Log

[Insert status page updates and customer communication]

### Code Fix

PR #1235: Fix connection leak in user profile endpoint
\`\`\`typescript
const client = await pool.connect();
try {
const user = await client.query('SELECT \* FROM users WHERE id = $1', [id]);
return user;
} finally {
client.release(); // ✅ Always release
}
\`\`\`
```

## Postmortem Best Practices

```markdown
# Blameless Postmortem Guidelines

## Do ✅

- Focus on systems and processes, not people
- Use timeline with exact timestamps
- Identify contributing factors, not just root cause
- Create actionable items with owners and dates
- Document what went well (positive reinforcement)
- Share widely for organizational learning

## Don't ❌

- Blame individuals or teams
- Hide or minimize the incident
- Skip the postmortem (even for small incidents)
- Create action items without owners
- Forget to follow up on action items
- Make it a blame session

## Template Sections

1. **Summary** (2-3 sentences)
2. **Impact** (numbers: users, revenue, duration)
3. **Timeline** (chronological events)
4. **Root Cause** (technical explanation)
5. **Contributing Factors** (broader context)
6. **What Went Well** (positive reinforcement)
7. **What Went Wrong** (improvement areas)
8. **Action Items** (concrete, owned, dated)
9. **Lessons Learned** (key takeaways)
```

## Output Checklist

- [ ] Timeline created
- [ ] Root cause identified
- [ ] Contributing factors documented
- [ ] Action items with owners
- [ ] Lessons learned captured
- [ ] Postmortem meeting held
- [ ] Document shared widely
- [ ] Follow-up scheduled
      ENDFILE

Related Skills

wiki-page-writer

33
from aAAaqwq/AGI-Super-Team

Generates rich technical documentation pages with dark-mode Mermaid diagrams, source code citations, and first-principles depth. Use when writing documentation, generating wiki pages, creating tech...

wechat-article-writer

33
from aAAaqwq/AGI-Super-Team

公众号文章自动化写作流程。支持资料搜索、文章撰写、爆款标题生成、排版优化。当用户提到写公众号、微信文章、自媒体写作、爆款文章、内容创作时使用此 skill。

seo-content-writer

33
from aAAaqwq/AGI-Super-Team

Write SEO blog posts, articles, landing pages with keyword integration, header optimization, and snippet targeting. SEO文章写作/内容优化

contract-and-proposal-writer

33
from aAAaqwq/AGI-Super-Team

Draft business proposals, SOWs, NDAs, MSAs, and freelance contracts with jurisdiction-aware structure and reusable clauses.

content-rewriter

33
from aAAaqwq/AGI-Super-Team

Cross-platform content repurposer. Takes one piece of content and rewrites it for multiple Chinese social media platforms, adapting tone, format, length, and style.

Cursor

content-research-writer

33
from aAAaqwq/AGI-Super-Team

Assists in writing high-quality content by conducting research, adding citations, improving hooks, iterating on outlines, and providing real-time feedback on each section. Transforms your writing process from solo effort to collaborative partnership.

wemp-operator

33
from aAAaqwq/AGI-Super-Team

> 微信公众号全功能运营——草稿/发布/评论/用户/素材/群发/统计/菜单/二维码 API 封装

Content & Documentation

zsxq-smart-publish

33
from aAAaqwq/AGI-Super-Team

Publish and manage content on 知识星球 (zsxq.com). Supports talk posts, Q&A, long articles, file sharing, digest/bookmark, homework tasks, and tag management. Use when publishing content to 知识星球, creating/editing posts, uploading files/images/audio, managing digests, batch publishing, or formatting content for 知识星球.

zoom-automation

33
from aAAaqwq/AGI-Super-Team

Automate Zoom meeting creation, management, recordings, webinars, and participant tracking via Rube MCP (Composio). Always search tools first for current schemas.

zoho-crm-automation

33
from aAAaqwq/AGI-Super-Team

Automate Zoho CRM tasks via Rube MCP (Composio): create/update records, search contacts, manage leads, and convert leads. Always search tools first for current schemas.

ziliu-publisher

33
from aAAaqwq/AGI-Super-Team

字流(Ziliu) - AI驱动的多平台内容分发工具。用于一次创作、智能适配排版、一键分发到16+平台(公众号/知乎/小红书/B站/抖音/微博/X等)。当用户需要多平台发布、内容排版、格式适配时使用。触发词:字流、ziliu、多平台发布、一键分发、内容分发、排版发布。

zhihu-post-skill

33
from aAAaqwq/AGI-Super-Team

> 知乎文章发布——知乎平台内容创作与发布自动化