runbook

Generate operational runbooks for services, procedures, or incident response with step-by-step procedures, troubleshooting guides, and escalation paths

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

runbook is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Generate operational runbooks for services, procedures, or incident response with step-by-step procedures, troubleshooting guides, and escalation paths

Teams using runbook should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/runbook/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/devops/runbook/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/runbook/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How runbook Compares

Feature / Agent	runbook	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Generate operational runbooks for services, procedures, or incident response with step-by-step procedures, troubleshooting guides, and escalation paths

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Runbook

Generate operational runbooks for services, procedures, or incident response. Investigates the codebase and infrastructure to produce accurate, actionable procedures.

## When to Use

- Creating operational documentation for a service
- Documenting deployment, scaling, or maintenance procedures
- Building incident response playbooks
- Standardizing operational procedures across teams

## Input

- **Topic**: Service name, operation type, or incident scenario
- **Scope**: deployment, scaling, failover, maintenance, troubleshooting
- **Optional**: Specific scenarios to cover

## Investigation Strategy

Launch parallel investigation tracks to gather comprehensive information:

### Track 1: Codebase Exploration

- Identify service entry points and configuration
- Find health check endpoints
- Map dependencies (databases, caches, external services)
- Locate logging and metrics instrumentation
- Find existing scripts or automation

### Track 2: Infrastructure Analysis

- Review deployment manifests (Kubernetes, Terraform, etc.)
- Identify scaling configuration
- Map service dependencies
- Find monitoring and alerting setup
- Review backup and recovery procedures

### Track 3: External Research

- Find operational best practices for the service type
- Research common failure modes
- Identify industry-standard procedures

## Output

Generate the runbook document using the template at `references/templates/runbook.md`.

The runbook should include:
- Service overview and architecture
- Dependencies with failure impact
- Step-by-step procedures with actual commands
- Troubleshooting guides for common issues
- Escalation paths and contacts

## Behavior

1. Parse topic to identify service and operation scope
2. Launch parallel investigation tracks
3. Extract configuration, endpoints, and dependencies from codebase
4. Identify common operations and failure modes
5. Generate step-by-step procedures with actual commands
6. Document troubleshooting steps and escalation paths

## Constraints

- **Accuracy**: All commands must be verified against actual codebase/infrastructure
- **Actionable**: Every procedure must have concrete, executable steps
- **Complete**: Include prerequisites, verification, and rollback for each procedure
- **Maintainable**: Note dependencies that may change and require updates

## Example

```
Input: "Generate runbook for the payment-service"

Investigation:
- Found deployment at k8s/payment-service/
- Found health endpoints: /health, /ready
- Dependencies: PostgreSQL (critical), Redis (cache), Stripe API
- Scaling: HPA configured, min 3, max 10 replicas
- Alerts: Prometheus rules in monitoring/

Generated Runbook: payment-service-runbook.md

## Overview
- Service: payment-service
- Owner: payments-team
- Criticality: P1

## Dependencies
| Dependency | Type | Criticality | Failure Impact |
|------------|------|-------------|----------------|
| PostgreSQL | Database | Critical | Full outage |
| Redis | Cache | High | Degraded latency |
| Stripe API | External | Critical | Payment failures |

## Procedures

### Deployment
1. Verify no active transactions
   ```bash
   kubectl exec -it payment-service-0 -- curl localhost:8080/metrics | grep active_transactions
   ```
2. Apply new deployment
   ```bash
   kubectl apply -f k8s/payment-service/deployment.yaml
   ```
3. Monitor rollout
   ```bash
   kubectl rollout status deployment/payment-service
   ```

### Scaling
```bash
kubectl scale deployment payment-service --replicas=5
```

## Troubleshooting

### High Latency
**Symptoms**: p99 latency > 500ms
**Diagnosis**:
```bash
kubectl top pods -l app=payment-service
kubectl logs -l app=payment-service --tail=100 | grep -i slow
```
**Resolution**: Check Redis connection, scale if CPU > 80%
```

Begin by identifying the service or operation to document and launching investigation tracks.

Related Skills

Runbooks

from diegosouzapw/awesome-omni-skill

Runbooks provide step-by-step procedures for operating and troubleshooting systems. Effective runbooks enable teams to handle incidents, perform maintenance, and operate systems consistently with clea

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

terraform-engineer

from diegosouzapw/awesome-omni-skill

Use when implementing infrastructure as code with Terraform across AWS, Azure, or GCP. Invoke for module development, state management, provider configuration, multi-environment workflows, infrastructure testing.

terraform-diagrams

from diegosouzapw/awesome-omni-skill

Generates architecture diagrams from Terraform code. Use when user has .tf files or asks to visualize Terraform infrastructure.

terraform-azurerm-set-diff-analyzer

from diegosouzapw/awesome-omni-skill

Wave 5 migration placeholder for `awesome-copilot/terraform-azurerm-set-diff-analyzer` imported from antigravity-awesome-skills manifest.

terraform-aws-modules

from diegosouzapw/awesome-omni-skill

Terraform module creation for AWS — reusable modules, state management, and HCL best practices. Use when building or reviewing Terraform AWS infrastructure.

terraform-analyzer

from diegosouzapw/awesome-omni-skill

Specialized skill for analyzing Terraform configurations. Supports parsing, security scanning (tfsec, checkov), cost estimation (infracost), drift detection, and plan visualization across AWS, Azure, and GCP.

terradev-gpu-cloud

from diegosouzapw/awesome-omni-skill

Cross-cloud GPU provisioning with NUMA-aligned topology optimization, K8s cluster creation, and inference overflow. Get real-time pricing across 11+ cloud providers, provision the cheapest GPUs in seconds, spin up production K8s clusters with automatic GPU-NIC pairing, and burst to cloud when your local GPU maxes out. BYOAPI — your keys never leave your machine.

tencent-cloud-pptx

from diegosouzapw/awesome-omni-skill

Create professional Tencent Cloud themed presentations from markdown content. Use when users request: (1) Creating presentations with Tencent Cloud branding, (2) Converting markdown documents to PowerPoint slides, (3) Generating slides with automatic content structuring, (4) Creating bilingual (Chinese/English) technical presentations, (5) Adding AI-generated images to presentation slides. Keywords to watch: 腾讯云, Tencent Cloud, markdown to PPT, presentation generation, slides with images.

telegram-reminders

from diegosouzapw/awesome-omni-skill

Send reminders and messages to Telegram with cloud-based scheduling. Use when the user wants to send immediate messages or schedule future reminders to Telegram. Supports text messages, timestamp-based scheduling, recurring reminders, viewing and canceling scheduled messages, and message history.

tech-detection

from diegosouzapw/awesome-omni-skill

Detects project tech stack including languages, frameworks, package managers, and cloud platforms. Use when analyzing a project, detecting technologies, bootstrapping infrastructure, or setting up permissions. Generates project-context.json with detected stack.

team-lifecycle

from diegosouzapw/awesome-omni-skill

Unified team skill for full lifecycle - spec/impl/test. All roles invoke this skill with --role arg for role-specific execution.