operating-production-services

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

242 stars

Best use case

operating-production-services is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.

Practical example

Example input

Use the "operating-production-services" skill to help with this workflow task. Context: SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response.
Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing
on-call practices. NOT for initial service development (use scaffolding skills instead).

Example output

A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.

When to use this skill

  • Use this skill when you want a reusable workflow rather than writing the same prompt again and again.

When not to use this skill

  • Do not use this when you only need a one-off answer and do not need a reusable workflow.
  • Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/operating-production-services/SKILL.md --create-dirs "https://raw.githubusercontent.com/aiskillstore/marketplace/main/skills/asmayaseen/operating-production-services/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/operating-production-services/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How operating-production-services Compares

Feature / Agentoperating-production-servicesStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

## Quick Reference

| Need | Go To |
|------|-------|
| Define reliability targets | [SLOs & Error Budgets](#slos--error-budgets) |
| Write incident report | [Postmortem Templates](#postmortem-templates) |
| Set up SLO alerting | [references/slo-alerting.md](references/slo-alerting.md) |

---

## SLOs & Error Budgets

### The Hierarchy

```
SLA (Contract) → SLO (Target) → SLI (Measurement)
```

### Common SLIs

```promql
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
```

### SLO Targets Reality Check

| SLO % | Downtime/Month | Downtime/Year |
|-------|----------------|---------------|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |

**Don't aim for 100%.** Each nine costs exponentially more.

### Error Budget

```
Error Budget = 1 - SLO Target
```

**Example:** 99.9% SLO = 0.1% error budget = 43 minutes/month

**Policy:**
| Budget Remaining | Action |
|------------------|--------|
| > 50% | Normal velocity |
| 10-50% | Postpone risky changes |
| < 10% | Freeze non-critical changes |
| 0% | Feature freeze, fix reliability |

See [references/slo-alerting.md](references/slo-alerting.md) for Prometheus recording rules and multi-window burn rate alerts.

---

## Postmortem Templates

### The Blameless Principle

| Blame-Focused | Blameless |
|---------------|-----------|
| "Who caused this?" | "What conditions allowed this?" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |

### When to Write Postmortems

- SEV1/SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes

### Standard Template

```markdown
# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX

## Executive Summary
One paragraph: what happened, impact, root cause, resolution.

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |

## Root Cause Analysis

### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]

## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X

## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |
```

### Quick Template (Minor Incidents)

```markdown
# Quick Postmortem: [Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3

## What Happened
One sentence description.

## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution

## Root Cause
One sentence.

## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]
```

---

## Postmortem Meeting Guide

### Structure (60 min)

1. **Opening (5 min)** - Remind: "We're here to learn, not blame"
2. **Timeline (15 min)** - Walk through events chronologically
3. **Analysis (20 min)** - What failed? Why? What allowed it?
4. **Action Items (15 min)** - Prioritize, assign owners, set dates
5. **Closing (5 min)** - Summarize learnings, confirm owners

### Facilitation Tips

- Redirect blame to systems: "What made this mistake possible?"
- Time-box tangents
- Document dissenting views
- Encourage quiet participants

---

## Anti-Patterns

| Don't | Do Instead |
|-------|------------|
| Aim for 100% SLO | Accept error budget exists |
| Skip small incidents | Small incidents reveal patterns |
| Orphan action items | Every item needs owner + date + ticket |
| Blame individuals | Ask "what conditions allowed this?" |
| Create busywork actions | Actions should prevent recurrence |

---

## Verification

Run: `python scripts/verify.py`

## References

- [references/slo-alerting.md](references/slo-alerting.md) - Prometheus rules, burn rate alerts, Grafana dashboards

Related Skills

deploying-to-production

242
from aiskillstore/marketplace

Automate creating a GitHub repository and deploying a web project to Vercel. Use when the user asks to deploy a website/app to production, publish a project, or set up GitHub + Vercel deployment.

production-code-audit

242
from aiskillstore/marketplace

Autonomously deep-scan entire codebase line-by-line, understand architecture and patterns, then systematically transform it to production-grade, corporate-level professional quality with optimizations

microservices-patterns

242
from aiskillstore/marketplace

Design microservices architectures with service boundaries, event-driven communication, and resilience patterns. Use when building distributed systems, decomposing monoliths, or implementing microservices.

linux-production-shell-scripts

242
from aiskillstore/marketplace

This skill should be used when the user asks to "create bash scripts", "automate Linux tasks", "monitor system resources", "backup files", "manage users", or "write production shell scripts". It provides ready-to-use shell script templates for system administration.

production-readiness

242
from aiskillstore/marketplace

Comprehensive pre-deployment validation ensuring code is production-ready. Runs complete audit pipeline, performance benchmarks, security scan, documentation check, and generates deployment checklist.

bioservices

242
from aiskillstore/marketplace

Primary Python tool for 40+ bioinformatics services. Preferred for multi-database workflows: UniProt, KEGG, ChEMBL, PubChem, Reactome, QuickGO. Unified API for queries, ID mapping, pathway analysis. For direct REST control, use individual database skills (uniprot-database, kegg-database).

operating-k8s-local

242
from aiskillstore/marketplace

Operates local Kubernetes clusters with Minikube for development and testing. Use when setting up local K8s, deploying applications locally, or debugging K8s issues. Covers Minikube, kubectl essentials, local image loading, and networking.

prototype-to-production

242
from aiskillstore/marketplace

Convert design prototypes (HTML, CSS, Figma exports) into production-ready components. Analyzes prototype structure, extracts design tokens, identifies reusable patterns, and generates typed React components. Adapts to existing project tech stack with React + TypeScript as default.

design-to-production

242
from aiskillstore/marketplace

Guided workflow for implementing HTML design prototypes as production React components with glassmorphism styling and quality standards enforcement. Use when converting design prototypes to production code.

azure-quotas

242
from aiskillstore/marketplace

Check/manage Azure quotas and usage across providers. For deployment planning, capacity validation, region selection. WHEN: "check quotas", "service limits", "current usage", "request quota increase", "quota exceeded", "validate capacity", "regional availability", "provisioning limits", "vCPU limit", "how many vCPUs available in my subscription".

DevOps & Infrastructure

raindrop-io

242
from aiskillstore/marketplace

Manage Raindrop.io bookmarks with AI assistance. Save and organize bookmarks, search your collection, manage reading lists, and organize research materials. Use when working with bookmarks, web research, reading lists, or when user mentions Raindrop.io.

Data & Research

zlibrary-to-notebooklm

242
from aiskillstore/marketplace

自动从 Z-Library 下载书籍并上传到 Google NotebookLM。支持 PDF/EPUB 格式,自动转换,一键创建知识库。