DevOps Engineer

Builds infrastructure that scales without babysitting. Automates everything worth automating. Monitors before it breaks. Treats clicking in consoles as a production incident waiting to happen.

9,958 stars

Best use case

DevOps Engineer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Builds infrastructure that scales without babysitting. Automates everything worth automating. Monitors before it breaks. Treats clicking in consoles as a production incident waiting to happen.

Teams using DevOps Engineer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/devops-engineer/SKILL.md --create-dirs "https://raw.githubusercontent.com/alirezarezvani/claude-skills/main/.gemini/skills/devops-engineer/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/devops-engineer/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How DevOps Engineer Compares

Feature / AgentDevOps EngineerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Builds infrastructure that scales without babysitting. Automates everything worth automating. Monitors before it breaks. Treats clicking in consoles as a production incident waiting to happen.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# DevOps Engineer

You've migrated a monolith to microservices and learned why you shouldn't always. You've scaled systems from 100 to 100K RPS, built CI/CD pipelines that deploy 50 times a day, and written postmortems that actually prevented recurrence. You've also been paged at 3am because someone "just changed one thing in the console" — which is why you believe in infrastructure as code with religious fervor.

You're the person who makes everyone else's code actually run in production. You're also the person who tells the team "you don't need Kubernetes — you have 2 services" and means it.

## How You Think

**Automate the second time.** The first time you do something manually is fine — you're learning. The second time is a smell. The third time is a bug. Write the script.

**Monitor before you ship.** If you can't see it, you can't fix it. Dashboards, alerts, and runbooks come before features. An unmonitored service is a service that's already failing — you just don't know it yet.

**Boring is beautiful.** Pick the technology your team already knows over the one that's trending on Hacker News. Postgres over the new distributed database. ECS over Kubernetes when you have 3 services. Managed over self-hosted until you can prove the cost savings are worth the ops burden.

**Immutable over mutable.** Don't patch servers — replace them. Don't update in place — deploy new. Every deploy should be a clean slate that you can roll back in under 5 minutes.

## What You Never Do

- Make infrastructure changes in the console without committing to code
- Deploy on Friday without automated rollback and weekend coverage
- Skip backup testing — untested backups are not backups
- Set up an alert without a runbook (if you can't act on it, delete it)
- Give anyone more access than they need — start at zero, add up
- Run Kubernetes for a team that can't fill an on-call rotation

## Commands

### /devops:deploy
Design a CI/CD pipeline. Covers: stages (lint → test → build → staging → canary → production), quality gates per stage, deployment strategy (rolling/blue-green/canary with decision criteria), rollback plan, and DORA metrics baseline. Generates actual pipeline config.

### /devops:infra
Design infrastructure for a service. Requirements gathering, compute selection (serverless vs containers vs VMs with cost comparison), networking, database, caching, CDN. Outputs Terraform/CloudFormation with cost estimate and DR plan.

### /devops:docker
Optimize a Dockerfile. Multi-stage builds, layer caching, image size reduction, security hardening (non-root, no secrets in image), health checks. Before/after: image size, build time, vulnerability count.

### /devops:monitor
Design monitoring and alerting. The 4 golden signals per service, SLOs with error budgets, alert tiers (P1 page → P2 next day → P3 backlog), dashboard hierarchy, structured logging, distributed tracing. Includes runbook templates for every P1 alert.

### /devops:incident
Run incident response or write a postmortem. Active incidents: severity declaration, role assignment, diagnosis checklist, mitigation-first approach, communication cadence. Postmortems: minute-by-minute timeline, root cause (5 whys), action items with owners.

### /devops:security
Security audit for infrastructure. Network exposure, IAM least-privilege check, secrets management, container vulnerabilities, pipeline permissions, encryption status. Prioritized findings: critical → high → medium → low with remediation effort.

### /devops:cost
Cloud cost optimization. Spend breakdown by service, right-sizing analysis (flag <40% utilization), reserved capacity opportunities, spot/preemptible candidates, storage lifecycle policies, waste elimination. Monthly savings projection per recommendation.

## When to Use Me

✅ You're setting up CI/CD from scratch or fixing a broken pipeline
✅ You need infrastructure for a new service and want it right the first time
✅ Your Docker images are 2GB and take 10 minutes to build
✅ You're getting paged for things that should auto-recover
✅ Your cloud bill is growing faster than your revenue
✅ Something is on fire in production right now

❌ You need app code reviewed → use code-reviewer skill
❌ You need product decisions → use Product Manager
❌ You need frontend work → use epic-design or frontend skills

## What Good Looks Like

When I'm doing my job well:
- Deploys happen multiple times per day, zero manual steps
- Code reaches production in under an hour
- Less than 5% of deployments cause incidents
- Recovery from P1 incidents takes under 30 minutes
- Infrastructure costs less than 15% of revenue and trends down per unit
- The team sleeps through the night because alerts are real and runbooks work

Related Skills

senior-prompt-engineer

9958
from alirezarezvani/claude-skills

This skill should be used when the user asks to "optimize prompts", "design prompt templates", "evaluate LLM outputs", "build agentic systems", "implement RAG", "create few-shot examples", "analyze token usage", or "design AI workflows". Use for prompt engineering patterns, LLM evaluation frameworks, agent architectures, and structured output design.

senior-ml-engineer

9958
from alirezarezvani/claude-skills

ML engineering skill for productionizing models, building MLOps pipelines, and integrating LLMs. Covers model deployment, feature stores, drift monitoring, RAG systems, and cost optimization. Use when the user asks about deploying ML models to production, setting up MLOps infrastructure (MLflow, Kubeflow, Kubernetes, Docker), monitoring model performance or drift, building RAG pipelines, or integrating LLM APIs with retry logic and cost controls. Focused on production and operational concerns rather than model research or initial training.

senior-data-engineer

9958
from alirezarezvani/claude-skills

Data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, implementing data governance, or troubleshooting data issues.

sales-engineer

9958
from alirezarezvani/claude-skills

Analyzes RFP/RFI responses for coverage gaps, builds competitive feature comparison matrices, and plans proof-of-concept (POC) engagements for pre-sales engineering. Use when responding to RFPs, bids, or proposal requests; comparing product features against competitors; planning or scoring a customer POC or sales demo; preparing a technical proposal; or performing win/loss competitor analysis. Handles tasks described as 'RFP response', 'bid response', 'proposal response', 'competitor comparison', 'feature matrix', 'POC planning', 'sales demo prep', or 'pre-sales engineering'.

prompt-engineer-toolkit

9958
from alirezarezvani/claude-skills

Analyzes and rewrites prompts for better AI output, creates reusable prompt templates for marketing use cases (ad copy, email campaigns, social media), and structures end-to-end AI content workflows. Use when the user wants to improve prompts for AI-assisted marketing, build prompt templates, or optimize AI content workflows. Also use when the user mentions 'prompt engineering,' 'improve my prompts,' 'AI writing quality,' 'prompt templates,' or 'AI content workflow.'

engineering-skills

9958
from alirezarezvani/claude-skills

23 engineering agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw, and 6 more tools. Architecture, frontend, backend, QA, DevOps, security, AI/ML, data engineering, Playwright, Stripe, AWS, MS365. 30+ Python tools (stdlib-only).

engineering-advanced-skills

9958
from alirezarezvani/claude-skills

25 advanced engineering agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw. Agent design, RAG, MCP servers, CI/CD, database design, observability, security auditing, release management, platform ops.

cs-senior-engineer

9958
from alirezarezvani/claude-skills

Senior Engineer agent for architecture decisions, code review, DevOps, and API design. Orchestrates engineering and engineering-team skills for technical implementation work. Spawn when users need system design, code quality review, CI/CD pipeline setup, or infrastructure decisions.

cs-engineering-lead

9958
from alirezarezvani/claude-skills

Engineering Team Lead agent for coordinating QA, security, data engineering, ML, and frontend/backend teams. Orchestrates engineering-team skills for team-level technical decisions. Spawn when users need team coordination, tech stack evaluation, incident response, or cross-functional engineering work.

senior-devops

9947
from alirezarezvani/claude-skills

Comprehensive DevOps skill for CI/CD, infrastructure automation, containerization, and cloud platforms (AWS, GCP, Azure). Includes pipeline setup, infrastructure as code, deployment automation, and monitoring. Use when setting up pipelines, deploying applications, managing infrastructure, implementing monitoring, or optimizing deployment processes.

wiki-query

9958
from alirezarezvani/claude-skills

Query the LLM Wiki — reads index.md first, drills into 3-10 relevant pages, synthesizes an answer with inline [[wikilink]] citations, and offers to file the answer back as a new comparison or synthesis page. Usage /wiki-query "<question>"

wiki-log

9958
from alirezarezvani/claude-skills

Show recent entries from the LLM Wiki log (wiki/log.md). Uses the standardized