operating-production-services
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/operating-production-services/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How operating-production-services Compares
| Feature / Agent | operating-production-services | Standard Approach |
|---|---|---|
| Platform Support | multi | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).
Which AI agents support this skill?
This skill is compatible with multi.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Operating Production Services
Production reliability patterns: measure what matters, learn from failures, improve systematically.
## Quick Reference
| Need | Go To |
|------|-------|
| Define reliability targets | [SLOs & Error Budgets](#slos--error-budgets) |
| Write incident report | [Postmortem Templates](#postmortem-templates) |
| Set up SLO alerting | [references/slo-alerting.md](references/slo-alerting.md) |
---
## SLOs & Error Budgets
### The Hierarchy
```
SLA (Contract) → SLO (Target) → SLI (Measurement)
```
### Common SLIs
```promql
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
```
### SLO Targets Reality Check
| SLO % | Downtime/Month | Downtime/Year |
|-------|----------------|---------------|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |
**Don't aim for 100%.** Each nine costs exponentially more.
### Error Budget
```
Error Budget = 1 - SLO Target
```
**Example:** 99.9% SLO = 0.1% error budget = 43 minutes/month
**Policy:**
| Budget Remaining | Action |
|------------------|--------|
| > 50% | Normal velocity |
| 10-50% | Postpone risky changes |
| < 10% | Freeze non-critical changes |
| 0% | Feature freeze, fix reliability |
See [references/slo-alerting.md](references/slo-alerting.md) for Prometheus recording rules and multi-window burn rate alerts.
---
## Postmortem Templates
### The Blameless Principle
| Blame-Focused | Blameless |
|---------------|-----------|
| "Who caused this?" | "What conditions allowed this?" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
### When to Write Postmortems
- SEV1/SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes
### Standard Template
```markdown
# Postmortem: [Incident Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX
## Executive Summary
One paragraph: what happened, impact, root cause, resolution.
## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |
## Root Cause Analysis
### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]
## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X
## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |
```
### Quick Template (Minor Incidents)
```markdown
# Quick Postmortem: [Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3
## What Happened
One sentence description.
## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution
## Root Cause
One sentence.
## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]
```
---
## Postmortem Meeting Guide
### Structure (60 min)
1. **Opening (5 min)** - Remind: "We're here to learn, not blame"
2. **Timeline (15 min)** - Walk through events chronologically
3. **Analysis (20 min)** - What failed? Why? What allowed it?
4. **Action Items (15 min)** - Prioritize, assign owners, set dates
5. **Closing (5 min)** - Summarize learnings, confirm owners
### Facilitation Tips
- Redirect blame to systems: "What made this mistake possible?"
- Time-box tangents
- Document dissenting views
- Encourage quiet participants
---
## Anti-Patterns
| Don't | Do Instead |
|-------|------------|
| Aim for 100% SLO | Accept error budget exists |
| Skip small incidents | Small incidents reveal patterns |
| Orphan action items | Every item needs owner + date + ticket |
| Blame individuals | Ask "what conditions allowed this?" |
| Create busywork actions | Actions should prevent recurrence |
---
## Verification
Run: `python scripts/verify.py`
## References
- [references/slo-alerting.md](references/slo-alerting.md) - Prometheus rules, burn rate alerts, Grafana dashboards