operating-production-services
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).
Best use case
operating-production-services is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).
Teams using operating-production-services should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/operating-production-services/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How operating-production-services Compares
| Feature / Agent | operating-production-services | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Operating Production Services
Production reliability patterns: measure what matters, learn from failures, improve systematically.
## Quick Reference
| Need | Go To |
|------|-------|
| Define reliability targets | [SLOs & Error Budgets](#slos--error-budgets) |
| Write incident report | [Postmortem Templates](#postmortem-templates) |
| Set up SLO alerting | [references/slo-alerting.md](references/slo-alerting.md) |
---
## SLOs & Error Budgets
### The Hierarchy
```
SLA (Contract) → SLO (Target) → SLI (Measurement)
```
### Common SLIs
```promql
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
```
### SLO Targets Reality Check
| SLO % | Downtime/Month | Downtime/Year |
|-------|----------------|---------------|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |
**Don't aim for 100%.** Each nine costs exponentially more.
### Error Budget
```
Error Budget = 1 - SLO Target
```
**Example:** 99.9% SLO = 0.1% error budget = 43 minutes/month
**Policy:**
| Budget Remaining | Action |
|------------------|--------|
| > 50% | Normal velocity |
| 10-50% | Postpone risky changes |
| < 10% | Freeze non-critical changes |
| 0% | Feature freeze, fix reliability |
See [references/slo-alerting.md](references/slo-alerting.md) for Prometheus recording rules and multi-window burn rate alerts.
---
## Postmortem Templates
### The Blameless Principle
| Blame-Focused | Blameless |
|---------------|-----------|
| "Who caused this?" | "What conditions allowed this?" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
### When to Write Postmortems
- SEV1/SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes
### Standard Template
```markdown
# Postmortem: [Incident Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX
## Executive Summary
One paragraph: what happened, impact, root cause, resolution.
## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |
## Root Cause Analysis
### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]
## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X
## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |
```
### Quick Template (Minor Incidents)
```markdown
# Quick Postmortem: [Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3
## What Happened
One sentence description.
## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution
## Root Cause
One sentence.
## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]
```
---
## Postmortem Meeting Guide
### Structure (60 min)
1. **Opening (5 min)** - Remind: "We're here to learn, not blame"
2. **Timeline (15 min)** - Walk through events chronologically
3. **Analysis (20 min)** - What failed? Why? What allowed it?
4. **Action Items (15 min)** - Prioritize, assign owners, set dates
5. **Closing (5 min)** - Summarize learnings, confirm owners
### Facilitation Tips
- Redirect blame to systems: "What made this mistake possible?"
- Time-box tangents
- Document dissenting views
- Encourage quiet participants
---
## Anti-Patterns
| Don't | Do Instead |
|-------|------------|
| Aim for 100% SLO | Accept error budget exists |
| Skip small incidents | Small incidents reveal patterns |
| Orphan action items | Every item needs owner + date + ticket |
| Blame individuals | Ask "what conditions allowed this?" |
| Create busywork actions | Actions should prevent recurrence |
---
## Verification
Run: `python scripts/verify.py`
## References
- [references/slo-alerting.md](references/slo-alerting.md) - Prometheus rules, burn rate alerts, Grafana dashboardsRelated Skills
genkit-production-expert
Build production Firebase Genkit applications including RAG systems, multi-step flows, and tool calling for Node.js/Python/Go. Deploy to Firebase Functions or Cloud Run with AI monitoring. Use when asked to "create genkit flow" or "implement RAG". Trigger with relevant phrases based on skill purpose.
generating-grpc-services
Generate gRPC service definitions, stubs, and implementations from Protocol Buffers. Use when creating high-performance gRPC services. Trigger with phrases like "generate gRPC service", "create gRPC API", or "build gRPC server".
gtm-operating-cadence
Design meeting rhythms, metric reporting, quarterly planning, and decision-making velocity for scaling companies. Use when decisions are slow, planning is broken, the company is growing but alignment is worse, or leadership meetings consume all time without producing decisions.
dataverse-python-production-code
Generate production-ready Python code using Dataverse SDK with error handling, optimization, and best practices
../../../marketing-skill/content-production/SKILL.md
No description provided.
deploying-to-production
Automate creating a GitHub repository and deploying a web project to Vercel. Use when the user asks to deploy a website/app to production, publish a project, or set up GitHub + Vercel deployment.
production-code-audit
Autonomously deep-scan entire codebase line-by-line, understand architecture and patterns, then systematically transform it to production-grade, corporate-level professional quality with optimizations
linux-production-shell-scripts
This skill should be used when the user asks to "create bash scripts", "automate Linux tasks", "monitor system resources", "backup files", "manage users", or "write production shell scripts". It provides ready-to-use shell script templates for system administration.
production-readiness
Comprehensive pre-deployment validation ensuring code is production-ready. Runs complete audit pipeline, performance benchmarks, security scan, documentation check, and generates deployment checklist.
operating-k8s-local
Operates local Kubernetes clusters with Minikube for development and testing. Use when setting up local K8s, deploying applications locally, or debugging K8s issues. Covers Minikube, kubectl essentials, local image loading, and networking.
prototype-to-production
Convert design prototypes (HTML, CSS, Figma exports) into production-ready components. Analyzes prototype structure, extracts design tokens, identifies reusable patterns, and generates typed React components. Adapts to existing project tech stack with React + TypeScript as default.
design-to-production
Guided workflow for implementing HTML design prototypes as production React components with glassmorphism styling and quality standards enforcement. Use when converting design prototypes to production code.