tracking-service-reliability

Define and track SLAs, SLIs, and SLOs for service reliability including availability, latency, and error rates. Use when establishing reliability targets or monitoring service health. Trigger with phrases like "define SLOs", "track SLI metrics", or "calculate error budget".

1,868 stars

byjeremylongshore

View on GitHub Installation ↓

Best use case

tracking-service-reliability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using tracking-service-reliability should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/tracking-service-reliability/SKILL.md --create-dirs "https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/main/plugins/performance/sla-sli-tracker/skills/tracking-service-reliability/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/tracking-service-reliability/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How tracking-service-reliability Compares

Feature / Agent	tracking-service-reliability	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

SKILL.md Source

# Sla Sli Tracker

Define and track SLAs, SLIs, and SLOs for service reliability including availability targets, latency budgets, error rate thresholds, and error budget burn rates.

## Overview

This skill provides a structured approach to defining and tracking SLAs, SLIs, and SLOs, which are essential for ensuring service reliability. It automates the process of setting performance targets and monitoring actual performance, enabling proactive identification and resolution of potential issues.

## How It Works

1. **SLI Definition**: The skill guides the user to define Service Level Indicators (SLIs) such as availability, latency, error rate, and throughput.
2. **SLO Target Setting**: The skill assists in setting Service Level Objectives (SLOs) by establishing target values for the defined SLIs (e.g., 99.9% availability).
3. **SLA Establishment**: The skill helps in formalizing Service Level Agreements (SLAs), which are customer-facing commitments based on the defined SLOs.

## When to Use This Skill

This skill activates when you need to:
- Define SLAs, SLIs, and SLOs for a service.
- Track service performance against defined objectives.
- Calculate error budgets based on SLOs.

## Examples

### Example 1: Defining SLOs for a New Service

User request: "Create SLOs for our new payment processing service."

The skill will:
1. Prompt the user to define SLIs (e.g., latency, error rate).
2. Assist in setting target values for each SLI (e.g., p99 latency < 100ms, error rate < 0.01%).

### Example 2: Tracking Availability

User request: "Track the availability SLI for the database service."

The skill will:
1. Guide the user in setting up the tracking of the availability SLI.
2. Visualize availability performance against the defined SLO.

## Best Practices

- **Granularity**: Define SLIs that are specific and measurable.
- **Realism**: Set SLOs that are challenging but achievable.
- **Alignment**: Ensure SLAs align with the defined SLOs and business requirements.

## Integration

This skill can be integrated with monitoring tools to automatically collect SLI data and track performance against SLOs. It can also be used in conjunction with alerting systems to trigger notifications when SLO violations occur.

## Prerequisites

- SLI definitions stored in ${CLAUDE_SKILL_DIR}/slos/sli-definitions.yaml
- Access to monitoring and metrics systems
- Historical performance data for baseline
- Business requirements for service reliability

## Instructions

1. Define Service Level Indicators (availability, latency, error rate, throughput)
2. Set Service Level Objectives with target values (e.g., 99.9% availability)
3. Formalize Service Level Agreements with customer commitments
4. Configure automated SLI data collection
5. Calculate error budgets based on SLOs
6. Track performance and alert on SLO violations

## Output

- SLI/SLO/SLA definition documents
- Real-time SLI metric dashboards
- Error budget calculations and burn rate
- SLO compliance reports
- Alerting configurations for violations

## Error Handling

If SLI/SLO tracking fails:
- Verify SLI definition completeness
- Check metric collection infrastructure
- Validate data accuracy and granularity
- Ensure alerting system connectivity
- Review error budget calculation logic

## Resources

- Google SRE book on SLIs and SLOs
- Error budget implementation guides
- Service reliability engineering practices
- SLO definition templates and examples

Related Skills

tracking-regression-tests

1868

from jeremylongshore/claude-code-plugins-plus-skills

Track and manage regression test suites across releases. Use when performing specialized testing. Trigger with phrases like "track regressions", "manage regression suite", or "validate against baseline".

windsurf-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Implement reliable Cascade workflows with checkpoints, rollback, and incremental editing. Use when building fault-tolerant AI coding workflows, preventing Cascade from breaking builds, or establishing safe practices for multi-file AI edits. Trigger with phrases like "windsurf reliability", "cascade safety", "windsurf rollback", "cascade checkpoint", "safe cascade workflow".

vercel-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Implement reliability patterns for Vercel deployments including circuit breakers, retry logic, and graceful degradation. Use when building fault-tolerant serverless functions, implementing retry strategies, or adding resilience to production Vercel services. Trigger with phrases like "vercel reliability", "vercel circuit breaker", "vercel resilience", "vercel fallback", "vercel graceful degradation".

supabase-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Build resilient Supabase integrations: circuit breakers wrapping createClient calls, offline queue with IndexedDB, graceful degradation with cached fallbacks, health check endpoints, retry with exponential backoff and jitter, and dual-write patterns for critical data. Use when building fault-tolerant apps, handling Supabase outages gracefully, implementing offline-first patterns, or adding retry logic to SDK calls. Trigger with phrases like "supabase circuit breaker", "supabase offline", "supabase retry", "supabase health check", "supabase fallback", "supabase resilience", "supabase dual write", "supabase outage".

snowflake-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Implement Snowflake reliability patterns: replication, failover, Time Travel recovery, and application-level resilience for Snowflake integrations. Use when building fault-tolerant pipelines, configuring disaster recovery, or adding resilience to production Snowflake services. Trigger with phrases like "snowflake reliability", "snowflake failover", "snowflake replication", "snowflake disaster recovery", "snowflake Time Travel".

shopify-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Implement reliability patterns for Shopify apps including circuit breakers for API outages, webhook retry handling, and graceful degradation. Trigger with phrases like "shopify reliability", "shopify circuit breaker", "shopify resilience", "shopify fallback", "shopify retry webhook".

sentry-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Build reliable Sentry integrations with graceful degradation, circuit breakers, and offline queuing. Use when implementing fault-tolerant error tracking, handling SDK initialization failures, building retry logic for Sentry transports, or ensuring apps survive Sentry outages. Trigger with "sentry reliability", "sentry circuit breaker", "sentry offline queue", "sentry graceful degradation", "sentry failover", or "resilient sentry setup".

salesforce-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Implement Salesforce reliability patterns including circuit breakers, idempotent upserts, and fallback caching. Use when building fault-tolerant Salesforce integrations, implementing retry strategies, or adding resilience to production Salesforce services. Trigger with phrases like "salesforce reliability", "salesforce circuit breaker", "salesforce idempotent", "salesforce resilience", "salesforce fallback", "salesforce retry".

retellai-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Retell AI reliability patterns — AI voice agent and phone call automation. Use when working with Retell AI for voice agents, phone calls, or telephony. Trigger with phrases like "retell reliability patterns", "retellai-reliability-patterns", "voice agent".

replit-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Implement reliability patterns for Replit: cold start handling, graceful shutdown, persistent state, and keep-alive. Use when building fault-tolerant Replit apps, handling container restarts, or adding resilience to production Replit deployments. Trigger with phrases like "replit reliability", "replit container restart", "replit data persistence", "replit always on", "replit graceful shutdown".

perplexity-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Implement reliability patterns for Perplexity Sonar API: circuit breaker, model fallback, streaming timeout, and citation validation. Trigger with phrases like "perplexity reliability", "perplexity circuit breaker", "perplexity fallback", "perplexity resilience", "perplexity timeout".

notion-reliability-patterns

1868

from jeremylongshore/claude-code-plugins-plus-skills

Graceful degradation when Notion is down: offline cache, retry with exponential backoff, circuit breaker, health checks, and fallback content. Use when building fault-tolerant Notion integrations for production. Trigger with phrases like "notion reliability", "notion circuit breaker", "notion offline fallback", "notion health check", "notion graceful degradation".