reliability-engineering

Use when designing or reviewing production reliability for APIs, SaaS platforms, background jobs, distributed workflows, mobile backends, or AI-enabled systems. Covers timeout and retry policy, degradation, queue safety, incident readiness, and recovery-aware design.

Best use case

reliability-engineering is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Use when designing or reviewing production reliability for APIs, SaaS platforms, background jobs, distributed workflows, mobile backends, or AI-enabled systems. Covers timeout and retry policy, degradation, queue safety, incident readiness, and recovery-aware design.

Teams using reliability-engineering should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/reliability-engineering/SKILL.md --create-dirs "https://raw.githubusercontent.com/peterbamuhigire/skills-web-dev/main/skills/devops-cloud/reliability-engineering/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/reliability-engineering/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How reliability-engineering Compares

Feature / Agentreliability-engineeringStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Use when designing or reviewing production reliability for APIs, SaaS platforms, background jobs, distributed workflows, mobile backends, or AI-enabled systems. Covers timeout and retry policy, degradation, queue safety, incident readiness, and recovery-aware design.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Reliability Engineering
Acknowledgement: Shared by Peter Bamuhigire, techguypeter.com, +256 784 464178.

<!-- dual-compat-start -->
## Use When

- Use when designing or reviewing production reliability for APIs, SaaS platforms, background jobs, distributed workflows, mobile backends, or AI-enabled systems. Covers timeout and retry policy, degradation, queue safety, incident readiness, and recovery-aware design.
- The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

## Do Not Use When

- The task is unrelated to `reliability-engineering` or would be better handled by a more specific companion skill.
- The request only needs a trivial answer and none of this skill's constraints or references materially help.

## Required Inputs

- Gather relevant project context, constraints, and the concrete problem to solve; load `references` only as needed.
- Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

## Workflow

- Read this `SKILL.md` first, then load only the referenced deep-dive files that are necessary for the task.
- Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
- Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

## Quality Standards

- Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
- Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
- Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

## Anti-Patterns

- Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
- Loading every reference file by default instead of using progressive disclosure.

## Outputs

- A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
- Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
- References used, companion skills, or follow-up actions when they materially improve execution.

## Evidence Produced

| Category | Artifact | Format | Example |
|----------|----------|--------|---------|
| Operability | Runbook | Markdown doc per `skill-composition-standards/references/runbook-template.md` | `docs/runbooks/payment-failures.md` |
| Operability | Rollback plan | Markdown doc per `skill-composition-standards/references/rollback-plan-template.md` | `docs/releases/2026-04-16-rollback.md` |
| Operability | Failure-mode catalogue | Markdown doc listing known failure modes and mitigations | `docs/reliability/failure-modes-checkout.md` |

## References

- Use the `references/` directory for deep detail after reading the core workflow below.
- AI incidents: AI-specific failures (hallucination spike, prompt drift, model regression, retrieval drift, jailbreak, cost runaway, agent-action, tool-vendor outage, eval drift) need AI-shaped detection, mitigation, evidence capture, and postmortems. Do not extend this skill's generic runbook to cover them. See the AI incident stack: `ai-incident-detection-and-triage`, `ai-incident-response-runbook`, `ai-incident-evidence-capture`, `ai-incident-customer-comms`, `ai-incident-postmortem`, `ai-rca-taxonomy`, `ai-incident-recovery-and-rollback`, `ai-incident-drill-and-game-day`.
<!-- dual-compat-end -->
Use this skill when correctness under ideal conditions is not enough. The goal is to keep important workflows safe, available enough, diagnosable, and recoverable under load, dependency failure, stale state, and operator error.

## Load Order

1. Load `world-class-engineering`.
2. Load this skill when the system has external dependencies, background processing, scale risk, or meaningful uptime expectations.
3. Pair it with `observability-monitoring`, `deployment-release-engineering`, and `distributed-systems-patterns` when services or queues are involved.

## Reliability Workflow

### 1. Classify Criticality

For each important workflow, define:

- user and business impact if it fails
- maximum acceptable downtime or degradation
- data-loss tolerance
- financial, compliance, or trust consequences
- recovery time expectation
- acceptable operator effort or toil

Not every path needs the same reliability level.

### 2. Map Failure Modes

Explicitly list:

- dependency timeout or outage
- partial write or partial side effect
- duplicate delivery or replay
- stale reads or cache inconsistency
- concurrency conflict
- operator or configuration error
- overload, backpressure, or queue growth
- release-induced regression

If a failure mode is plausible and unhandled, the design is incomplete.

### 3. Design Protection Mechanisms

Choose deliberate policies for:

- timeout budgets
- retries and backoff
- idempotency and deduplication
- circuit breaking or load shedding
- queues, dead-letter handling, and replay
- graceful degradation or fallback behavior
- concurrency limits and admission control
- reconciliation jobs for eventually consistent workflows

### 4. Design Recovery

For every critical flow, define:

- how to detect failure
- who owns the first response
- whether to retry, compensate, reconcile, or roll back
- what can be replayed safely
- what manual tooling or runbook is needed
- how recent deployments or config changes will be ruled in or out quickly

### 5. Verify Reliability

Before production claims, produce evidence for:

- timeout and retry behavior
- degraded-state behavior
- queue recovery or replay
- duplicate-request safety
- alert and runbook usefulness
- overload or backpressure behavior
- staged recovery drills or game-day exercises for the highest-cost failures

## Reliability Standards

### Retries and Timeouts

- Retries without idempotency are usually a bug.
- Timeouts must be shorter than user patience and upstream collapse thresholds.
- Use bounded retries with jitter for transient failures.
- Do not retry validation failures, authorization failures, or deterministic business rejections.

### Queues and Jobs

- Every job needs an idempotency strategy or deduplication key.
- Poison messages need dead-letter or quarantine behavior.
- Replay must be safe, observable, and permissioned.
- Long-running jobs need progress or heartbeat signals.
- Queues need saturation and age monitoring, not only failure counts.

### Degradation

- Define what the user sees when a dependency is slow or unavailable.
- Prefer reduced capability over total failure where business risk allows.
- Fail closed for privileged or security-sensitive paths.
- Fail open only with deliberate justification and bounded blast radius.

### Incident Readiness

- Alerts need an owner and a first action.
- Correlate incidents to release version, tenant, actor, and dependency.
- Keep recovery tools safe for operators under stress.
- Write runbooks for high-cost incidents before the incident happens.
- Rehearse at least the top failure scenarios often enough that the response is not theoretical.

## Deliverables

For meaningful reliability work, produce:

- criticality table
- failure-mode table
- timeout and retry policy
- degradation and fallback notes
- queue and replay strategy
- incident ownership and recovery outline
- reliability verification or exercise plan

## Review Checklist

- [ ] Critical workflows have explicit reliability targets or expectations.
- [ ] Retries, timeouts, and idempotency rules are coherent.
- [ ] Duplicate, replay, and partial-failure cases are handled safely.
- [ ] Degradation behavior is defined for dependency failures.
- [ ] Recovery paths and owners are explicit.
- [ ] Reliability claims are backed by tests, simulations, or staged evidence.

## References

- [references/reliability-patterns.md](references/reliability-patterns.md): Design rules for timeouts, retries, queues, and degradation.
- [references/incident-readiness.md](references/incident-readiness.md): Incident preparation and recovery prompts.
- [references/reliability-verification.md](references/reliability-verification.md): Reliability drills, overload checks, and evidence expectations.

Related Skills

world-class-engineering

8
from peterbamuhigire/skills-web-dev

Use when designing, building, reviewing, or upgrading production software systems that must be secure, performant, maintainable, scalable, and user-centered. Apply before writing specs, code, architecture, APIs, databases, mobile apps, SaaS platforms, or ERP systems.

gis-platform-engineering

8
from peterbamuhigire/skills-web-dev

Use when implementing GIS maps, spatial data services, maps integrations, geocoding, spatial APIs, or PostGIS-backed geospatial platforms. Load absorbed GIS mapping, maps integration, and PostGIS backend references as needed.

deployment-release-engineering

8
from peterbamuhigire/skills-web-dev

Use when designing or reviewing deployment pipelines, rollout strategies, release gates, rollback plans, migration-safe releases, and post-deploy verification for production systems. Covers build promotion, environment strategy, release evidence, and operational safety.

postgresql-engineering

8
from peterbamuhigire/skills-web-dev

Use when designing, implementing, or reviewing PostgreSQL application data models, SQL, indexes, constraints, extensions, server-side routines, and production query patterns. Load the absorbed PostgreSQL reference files for fundamentals, advanced SQL, schema patterns, and server programming.

mysql-engineering

8
from peterbamuhigire/skills-web-dev

Use when designing, implementing, or reviewing MySQL application schemas, SQL, indexes, constraints, stored routines, and production query patterns. Load absorbed MySQL best-practice, data-modeling, and advanced-SQL reference files as needed.

database-reliability

8
from peterbamuhigire/skills-web-dev

Database reliability engineering: SLI/SLO design and error-budget policy for the data tier, blameless postmortems, escalation tiers and on-call hand-off, game days for MySQL/PostgreSQL, operational runbooks, change management, capacity planning, and backup verification. Use when setting up production database SRE practice, defining database SLOs/error budgets, running database postmortems, or hardening on-call for MySQL/PostgreSQL.

database-design-engineering

8
from peterbamuhigire/skills-web-dev

Use when designing or reviewing relational or document-backed data architecture for SaaS platforms, ERP systems, APIs, analytics stores, or mobile sync. Covers domain modeling, tenancy, indexing, migrations, integrity, retention, and performance tradeoffs.

ai-prompt-engineering

8
from peterbamuhigire/skills-web-dev

Use when writing, refining, or structuring prompts for AI-powered app features — system prompts, user prompt templates, few-shot examples, chain-of-thought, prompt versioning, and defensive prompting

web-app-security-audit

8
from peterbamuhigire/skills-web-dev

Use when auditing a PHP/JavaScript/HTML web application for security vulnerabilities. Covers configuration, authentication, authorization, input validation, XSS, API security, HTTP headers, and dependency scanning. Produces a severity-rated audit...

vibe-security-skill

8
from peterbamuhigire/skills-web-dev

Use when designing or reviewing security for a web application, API, or multi-tenant SaaS — produces threat model, abuse case list, auth/authz matrix, and secret handling plan; covers OWASP Top 10 2025 and the AI-code-generation blind spots. Neighbours — api-design-first owns auth model fields, deployment-release-engineering owns secret rotation choreography, ai-security and llm-security own model-specific threats.

network-security

8
from peterbamuhigire/skills-web-dev

Use when designing, hardening, or auditing network-layer security for self-managed Debian/Ubuntu SaaS infrastructure — firewalls (nftables/UFW), WAF (ModSecurity + OWASP CRS), VPN (WireGuard, OpenVPN, IPsec), TLS/PKI ops, IDS/IPS (Suricata, Fail2ban), zero-trust, SSH hardening, DDoS mitigation, DNS security. Complements web-app-security-audit (app layer) and cicd-devsecops (secrets/CI).

linux-security-hardening

8
from peterbamuhigire/skills-web-dev

Use when hardening a Debian/Ubuntu server — user/group/sudo hardening, file permission audits, PAM password policy + MFA, AppArmor mandatory access control, auditd system call logging, kernel sysctl hardening, file integrity monitoring (AIDE), rootkit detection (rkhunter/chkrootkit), unattended security patching, GRUB + UEFI + LUKS boot security, and CIS benchmark compliance.