slo-workflow

SLI/SLO/SLA and error budget workflow: define service level indicators, set objectives, calculate error budgets, implement burn rate alerting, and use error budgets to gate risky deployments. Covers Prometheus, Datadog, and Google SRE methodology.

8 stars

Best use case

slo-workflow is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

SLI/SLO/SLA and error budget workflow: define service level indicators, set objectives, calculate error budgets, implement burn rate alerting, and use error budgets to gate risky deployments. Covers Prometheus, Datadog, and Google SRE methodology.

Teams using slo-workflow should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/slo-workflow/SKILL.md --create-dirs "https://raw.githubusercontent.com/marvinrichter/clarc/main/skills/slo-workflow/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/slo-workflow/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How slo-workflow Compares

Feature / Agentslo-workflowStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

SLI/SLO/SLA and error budget workflow: define service level indicators, set objectives, calculate error budgets, implement burn rate alerting, and use error budgets to gate risky deployments. Covers Prometheus, Datadog, and Google SRE methodology.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# SLO Workflow

> **Scope**: Reliability engineering — defining what "good" looks like and alerting when you're heading toward bad.
> For implementing the monitoring infrastructure, see [observability](../observability/SKILL.md).
> For incident response when SLOs are breached, see [incident-response](../incident-response/SKILL.md).

## When to Activate

- Defining reliability targets for a new service
- Setting up error budget tracking and burn rate alerts
- Deciding whether to pause a risky deployment (error budget gating)
- Creating an SLA with a customer or business unit
- Reviewing whether current alerting reflects user pain

---

## Concepts

| Term | Definition | Example |
|------|-----------|---------|
| **SLI** | Service Level Indicator — the measured metric | 99.2% of requests completed in < 500ms over last 28 days |
| **SLO** | Service Level Objective — the internal reliability target | 99.5% availability |
| **SLA** | Service Level Agreement — the customer-facing contractual commitment | 99.9% uptime, compensated if breached |
| **Error Budget** | The allowed amount of unreliability: `1 - SLO` | 0.5% = 3.65 hours of downtime per month |

**Rule**: SLA < SLO < actual performance. Buffer between SLO and SLA absorbs measurement lag.

---

## Step 1: Choose Your SLIs

SLIs should measure **user pain**, not infrastructure health. Prefer request-level metrics over host metrics.

### SLI taxonomy (Google SRE)

| Service type | Recommended SLIs |
|-------------|-----------------|
| Request/response (API) | Availability (% success), Latency (p95/p99) |
| Data pipeline | Freshness (time since last update), Correctness |
| Storage | Durability (data loss rate), Throughput |
| Streaming | Throughput (events/sec), Consumer lag |

### Availability SLI

```promql
# Prometheus: fraction of successful requests
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
```

### Latency SLI (histogram)

```promql
# Fraction of requests completing in < 300ms
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) < 0.3
```

---

## Step 2: Set the SLO

### SLO selection heuristics

- Start with observed performance minus a small buffer: if you're currently at 99.8%, set SLO at 99.5%
- Match customer expectations: B2B enterprise needs stricter SLOs than B2C
- Consider the cost of reliability: 99.99% costs ~10x more than 99.9%
- Set latency SLOs at p95, not p99 initially (p99 is expensive to achieve)

### SLO values and their meaning

| SLO | Max downtime/month | Max downtime/year |
|-----|-------------------|------------------|
| 99% | 7.2 hours | 3.65 days |
| 99.5% | 3.6 hours | 1.83 days |
| 99.9% | 43.8 minutes | 8.76 hours |
| 99.95% | 21.9 minutes | 4.38 hours |
| 99.99% | 4.4 minutes | 52.6 minutes |

---

## Step 3: Calculate Error Budget

```python
def error_budget(slo_percent, window_days=28):
    """Calculate error budget for a given SLO and window."""
    error_rate = 1 - (slo_percent / 100)
    total_minutes = window_days * 24 * 60
    budget_minutes = total_minutes * error_rate
    budget_requests_per_million = error_rate * 1_000_000
    return {
        'budget_minutes': budget_minutes,
        'budget_hours': budget_minutes / 60,
        'budget_rpm': budget_requests_per_million,
    }

# SLO 99.5%, 28-day window
budget = error_budget(99.5)
# → {'budget_minutes': 201.6, 'budget_hours': 3.36, 'budget_rpm': 5000}
```

### Error budget consumption tracking

```promql
# Availability error budget consumed in 28 days (%)
(
  1 - (
    sum(increase(http_requests_total{status!~"5.."}[28d]))
    /
    sum(increase(http_requests_total[28d]))
  )
) / 0.005  # divide by (1 - SLO) = error budget
```

---

## Step 4: Burn Rate Alerting

Paging on SLO breach (threshold alerts) is too slow. Use **burn rate** to alert early.

### Burn rate concept

A burn rate of 1 = consuming error budget at exactly the rate that would exhaust it by the end of the window.
A burn rate of 14.4 over 1 hour = will exhaust the entire 28-day budget in 2 hours.

### Multi-window, multi-burn-rate alerts (Google SRE Book recommendation)

| Alert | Short window | Long window | Burn rate | Severity |
|-------|-------------|-------------|-----------|---------|
| Page | 5 min | 1 hour | 14.4x | Critical |
| Page | 30 min | 6 hours | 6x | Critical |
| Ticket | 6 hours | 3 days | 3x | Warning |
| Notify | 3 days | — | 1x | Info |

```yaml
# Prometheus alerting rules (99.5% SLO, 28-day window)
groups:
  - name: slo_burn_rate
    rules:
      - alert: ErrorBudgetCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > 14.4 * 0.005  # 14.4x burn rate * (1 - SLO)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > 14.4 * 0.005
        for: 0m
        labels:
          severity: page
        annotations:
          summary: "Error budget burning at 14.4x — exhausted in 2 hours"

      - alert: ErrorBudgetWarning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            / sum(rate(http_requests_total[30m]))
          ) > 6 * 0.005
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          ) > 6 * 0.005
        for: 0m
        labels:
          severity: ticket
        annotations:
          summary: "Error budget burning at 6x — exhausted in ~2 days"
```

---

## Step 5: Error Budget Policy

Define what happens at each budget threshold:

| Budget remaining | Policy |
|-----------------|--------|
| > 50% | Normal velocity: ship features, run experiments |
| 25–50% | Review risky changes before shipping |
| 10–25% | Freeze non-critical deployments; focus on reliability |
| < 10% | Freeze all deployments; all-hands reliability sprint |
| 0% | SLA breach risk; escalate to VP/CTO |

### Deployment gating (CI/CD integration)

```bash
#!/bin/bash
# check-error-budget.sh — fail CI if budget < 10%
BUDGET_PCT=$(curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode "query=error_budget_remaining_pct" \
  | jq -r '.data.result[0].value[1]')

if (( $(echo "$BUDGET_PCT < 10" | bc -l) )); then
  echo "ERROR: Error budget at ${BUDGET_PCT}% — deployment blocked"
  exit 1
fi
echo "Error budget at ${BUDGET_PCT}% — deployment allowed"
```

---

## SLO Document Template

```markdown
## SLO: [Service Name] — [SLI Type]

**Owner**: [team]
**SLI**: [exact metric definition]
**SLO**: [X]% over a 28-day rolling window
**SLA** (if applicable): [Y]% — breach triggers [compensation/escalation]

### Error budget
- 28-day budget: [N minutes / N requests per million]
- Budget policy: [link to policy doc]

### Alert thresholds
| Alert | Condition | Severity | Response |
|-------|-----------|----------|---------|
| Page | 14.4x burn (5m + 1h windows) | Critical | On-call response < 15min |
| Ticket | 6x burn (30m + 6h windows) | Warning | Next business day |

### Measurement
- Query: [Prometheus/Datadog query]
- Dashboard: [link]
- Report cadence: Weekly in engineering all-hands

### History
| Month | SLO met? | Budget consumed | Incidents |
|-------|----------|----------------|-----------|
```

---

## Full SLO Lifecycle — End-to-End Walkthrough

This example traces a new API service from zero to fully alerting SLOs.

**1. Define SLI** — measure user-facing success rate in Prometheus:
```promql
# availability-sli.promql
sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
/ sum(rate(http_requests_total{service="payment-api"}[5m]))
```

**2. Set SLO** — target 99.5% availability over a 28-day rolling window.
Budget: `(1 - 0.995) * 28 * 24 * 60 = 201.6 minutes` of acceptable downtime.

**3. Create burn rate alert** — page on-call if budget burns at 14.4× (exhausted in 2 h):
```yaml
# prometheus/rules/payment-api-slo.yml
- alert: PaymentAPIErrorBudgetCritical
  expr: |
    (sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m]))
     / sum(rate(http_requests_total{service="payment-api"}[5m])))
    > 14.4 * 0.005
  labels:
    severity: page
  annotations:
    summary: "payment-api burning error budget at 14.4× — deploy freeze"
```

**4. Enforce budget policy** — gate CI deploys when budget < 10%:
```bash
# .github/workflows/deploy.yml (excerpt)
- name: Check error budget
  run: bash scripts/check-error-budget.sh payment-api 10
  # → exits 1 and blocks deploy if budget < 10%
```

**5. Trigger postmortem** — when budget hits 0%, auto-open a Jira ticket:
```bash
BUDGET=$(curl -s "$PROMETHEUS/api/v1/query?query=error_budget_remaining_pct{service='payment-api'}" \
  | jq -r '.data.result[0].value[1]')
if (( $(echo "$BUDGET <= 0" | bc -l) )); then
  gh issue create --repo org/postmortems \
    --title "SLO breach: payment-api $(date +%Y-%m-%d)" \
    --body "Error budget exhausted. See Grafana dashboard: $DASHBOARD_URL"
fi
```

---

## Related

- [observability](../observability/SKILL.md) — metrics infrastructure
- [incident-response](../incident-response/SKILL.md) — when SLOs breach
- `/slo` command — define and document SLOs for a service

Related Skills

tdd-workflow-advanced

8
from marvinrichter/clarc

TDD anti-patterns — writing code before tests, testing implementation details instead of behavior, using waitForTimeout as a sync strategy, chaining tests that share state, mocking the system under test instead of its dependencies.

debugging-workflow

8
from marvinrichter/clarc

Systematic debugging methodology — reproduction, isolation, binary search, profiling, distributed tracing, memory leaks, and race conditions. A structured approach that prevents random guessing and finds root causes faster.

analytics-workflow

8
from marvinrichter/clarc

Data & product analytics workflow: event taxonomy design, instrumentation (Segment, PostHog, Mixpanel, Amplitude, GA4), analytics pipelines, funnel and retention analysis, and dashboard design. Complements observability (infrastructure metrics) with product/user behavior tracking.

tdd-workflow

8
from marvinrichter/clarc

Use this skill when writing new features, fixing bugs, or refactoring code. Enforces test-driven development with 80%+ coverage including unit, integration, and E2E tests.

zero-trust-patterns

8
from marvinrichter/clarc

Zero-Trust security patterns — mTLS between microservices (Istio/SPIFFE), SPIRE workload identity, OPA/Envoy authorization, NetworkPolicy default-deny-all, short-lived credentials, service mesh security, and Kubernetes RBAC hardening.

wireframing

8
from marvinrichter/clarc

Wireframing and prototyping workflow: fidelity levels (lo-fi sketch → mid-fi wireframe → hi-fi prototype), tool selection (Figma, Excalidraw, Balsamiq), user flow diagrams, wireframe annotation standards, information architecture (IA) mapping, and the handoff from wireframe to visual design. For developers who need to communicate UI structure before writing code.

webrtc-patterns

8
from marvinrichter/clarc

WebRTC patterns — peer connection setup, ICE/STUN/TURN configuration, signaling server design, SFU vs mesh topology, screen sharing, media track management, and reconnect/ICE restart handling.

webhook-patterns

8
from marvinrichter/clarc

Webhook patterns for receiving, verifying (HMAC), and idempotently processing third-party events. Covers Stripe, GitHub, and generic webhook patterns, delivery guarantees, retry handling, and testing.

web-performance

8
from marvinrichter/clarc

Web performance optimization: Core Web Vitals (LCP, CLS, INP), Lighthouse CI with budget configuration, bundle analysis (webpack-bundle-analyzer, vite-bundle-visualizer), hydration performance, network waterfall reading, image optimization (WebP/AVIF, srcset), and font performance.

wasm-performance

8
from marvinrichter/clarc

WebAssembly performance: wasm-opt binary optimization, size reduction (panic=abort, LTO, strip), profiling WASM in Chrome DevTools, memory management (linear memory, avoiding GC pressure), SIMD, and multi-threading with SharedArrayBuffer.

wasm-patterns

8
from marvinrichter/clarc

WebAssembly patterns: wasm-pack, wasm-bindgen (JS↔Wasm interop), WASI, Component Model, wasm-opt, Rust-to-WASM compilation, JS integration (web workers, streaming instantiation), and production deployment (CDN, Content-Type headers).

visual-testing

8
from marvinrichter/clarc

Visual Regression Testing: tool comparison (Chromatic/Percy/Playwright screenshots/BackstopJS), pixel-diff vs AI-based comparison, baseline management, flakiness strategies (masks, tolerances, waitForLoadState), CI integration with GitHub Actions, and Storybook integration.