slo-implementation
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or m...
Best use case
slo-implementation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or m...
Teams using slo-implementation should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/slo-implementation/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How slo-implementation Compares
| Feature / Agent | slo-implementation | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or m...
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# SLO Implementation
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
## Do not use this skill when
- The task is unrelated to slo implementation
- You need a different domain or tool outside this scope
## Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.
## Purpose
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
## Use this skill when
- Define service reliability targets
- Measure user-perceived reliability
- Implement error budgets
- Create SLO-based alerts
- Track reliability goals
## SLI/SLO/SLA Hierarchy
```
SLA (Service Level Agreement)
↓ Contract with customers
SLO (Service Level Objective)
↓ Internal reliability target
SLI (Service Level Indicator)
↓ Actual measurement
```
## Defining SLIs
### Common SLI Types
#### 1. Availability SLI
```promql
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
```
#### 2. Latency SLI
```promql
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
```
#### 3. Durability SLI
```
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)
```
**Reference:** See `references/slo-definitions.md`
## Setting SLO Targets
### Availability SLO Examples
| SLO % | Downtime/Month | Downtime/Year |
|-------|----------------|---------------|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43.2 minutes | 8.76 hours |
| 99.95%| 21.6 minutes | 4.38 hours |
| 99.99%| 4.32 minutes | 52.56 minutes |
### Choose Appropriate SLOs
**Consider:**
- User expectations
- Business requirements
- Current performance
- Cost of reliability
- Competitor benchmarks
**Example SLOs:**
```yaml
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
```
## Error Budget Calculation
### Error Budget Formula
```
Error Budget = 1 - SLO Target
```
**Example:**
- SLO: 99.9% availability
- Error Budget: 0.1% = 43.2 minutes/month
- Current Error: 0.05% = 21.6 minutes/month
- Remaining Budget: 50%
### Error Budget Policy
```yaml
error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability
```
**Reference:** See `references/error-budget.md`
## SLO Implementation
### Prometheus Recording Rules
```yaml
# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
```
### SLO Alerting Rules
```yaml
groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Slow burn: 6x rate, 6 hour window
# Consumes 5% error budget in 6 hours
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Error budget exhausted
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO error budget exhausted"
description: "Error budget remaining: {{ $value }}%"
```
## SLO Dashboard
**Grafana Dashboard Structure:**
```
┌────────────────────────────────────┐
│ SLO Compliance (Current) │
│ ✓ 99.95% (Target: 99.9%) │
├────────────────────────────────────┤
│ Error Budget Remaining: 65% │
│ ████████░░ 65% │
├────────────────────────────────────┤
│ SLI Trend (28 days) │
│ [Time series graph] │
├────────────────────────────────────┤
│ Burn Rate Analysis │
│ [Burn rate by time window] │
└────────────────────────────────────┘
```
**Example Queries:**
```promql
# Current SLO compliance
sli:http_availability:ratio * 100
# Error budget remaining
slo:http_availability:error_budget_remaining
# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)
```
## Multi-Window Burn Rate Alerts
```yaml
# Combination of short and long windows reduces false positives
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical
```
## SLO Review Process
### Weekly Review
- Current SLO compliance
- Error budget status
- Trend analysis
- Incident impact
### Monthly Review
- SLO achievement
- Error budget usage
- Incident postmortems
- SLO adjustments
### Quarterly Review
- SLO relevance
- Target adjustments
- Process improvements
- Tooling enhancements
## Best Practices
1. **Start with user-facing services**
2. **Use multiple SLIs** (availability, latency, etc.)
3. **Set achievable SLOs** (don't aim for 100%)
4. **Implement multi-window alerts** to reduce noise
5. **Track error budget** consistently
6. **Review SLOs regularly**
7. **Document SLO decisions**
8. **Align with business goals**
9. **Automate SLO reporting**
10. **Use SLOs for prioritization**
## Reference Files
- `assets/slo-template.md` - SLO definition template
- `references/slo-definitions.md` - SLO definition patterns
- `references/error-budget.md` - Error budget calculations
## Related Skills
- `prometheus-configuration` - For metric collection
- `grafana-dashboards` - For SLO visualizationRelated Skills
rag-implementation
RAG (Retrieval-Augmented Generation) implementation workflow covering embedding selection, vector database setup, chunking strategies, and retrieval optimization.
hybrid-search-implementation
Combine vector and keyword search for improved retrieval. Use when implementing RAG systems, building search engines, or when neither approach alone provides sufficient recall.
cqrs-implementation
Implement Command Query Responsibility Segregation for scalable architectures. Use when separating read and write models, optimizing query performance, or building event-sourced systems.
auth-implementation-patterns
Master authentication and authorization patterns including JWT, OAuth2, session management, and RBAC to build secure, scalable access control systems. Use when implementing auth systems, securing A...
zustand-store-ts
Create Zustand stores with TypeScript, subscribeWithSelector middleware, and proper state/action separation. Use when building React state management, creating global stores, or implementing reacti...
zoom-automation
Automate Zoom meeting creation, management, recordings, webinars, and participant tracking via Rube MCP (Composio). Always search tools first for current schemas.
zoho-crm-automation
Automate Zoho CRM tasks via Rube MCP (Composio): create/update records, search contacts, manage leads, and convert leads. Always search tools first for current schemas.
zendesk-automation
Automate Zendesk tasks via Rube MCP (Composio): tickets, users, organizations, replies. Always search tools first for current schemas.
zapier-make-patterns
No-code automation democratizes workflow building. Zapier and Make (formerly Integromat) let non-developers automate business processes without writing code. But no-code doesn't mean no-complexity ...
youtube-summarizer
Extract transcripts from YouTube videos and generate comprehensive, detailed summaries using intelligent analysis frameworks
youtube-automation
Automate YouTube tasks via Rube MCP (Composio): upload videos, manage playlists, search content, get analytics, and handle comments. Always search tools first for current schemas.
xss-html-injection
This skill should be used when the user asks to "test for XSS vulnerabilities", "perform cross-site scripting attacks", "identify HTML injection flaws", "exploit client-side injection...