devops-infrastructure
Guides Docker, CI/CD pipelines, deployment strategies, infrastructure as code, and observability setup. Use when writing Dockerfiles, configuring GitHub Actions, planning deployments, setting up monitoring, or when asked about containers, pipelines, Terraform, or production infrastructure.
Best use case
devops-infrastructure is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Guides Docker, CI/CD pipelines, deployment strategies, infrastructure as code, and observability setup. Use when writing Dockerfiles, configuring GitHub Actions, planning deployments, setting up monitoring, or when asked about containers, pipelines, Terraform, or production infrastructure.
Teams using devops-infrastructure should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/devops-infrastructure/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How devops-infrastructure Compares
| Feature / Agent | devops-infrastructure | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Guides Docker, CI/CD pipelines, deployment strategies, infrastructure as code, and observability setup. Use when writing Dockerfiles, configuring GitHub Actions, planning deployments, setting up monitoring, or when asked about containers, pipelines, Terraform, or production infrastructure.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
# DevOps & Infrastructure
### When to Load
- **Trigger**: Docker, CI/CD pipelines, deployment configuration, monitoring, infrastructure as code
- **Skip**: Application logic only with no infrastructure or deployment concerns
## DevOps Workflow
Copy this checklist and track progress:
```
DevOps Setup Progress:
- [ ] Step 1: Containerize application (Dockerfile)
- [ ] Step 2: Set up CI/CD pipeline
- [ ] Step 3: Define deployment strategy
- [ ] Step 4: Configure monitoring & alerting
- [ ] Step 5: Set up environment management
- [ ] Step 6: Document runbooks
- [ ] Step 7: Validate against anti-patterns checklist
```
## Docker Best Practices
### Multi-Stage Build
```dockerfile
# WRONG: Single stage, bloated image
FROM node:20
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
CMD ["node", "dist/index.js"]
# Result: 1.2GB image with devDependencies and source code
# CORRECT: Multi-stage build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]
# Result: ~150MB image, no devDependencies, non-root user
```
### Python Multi-Stage
```dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
FROM python:3.12-slim AS runner
WORKDIR /app
RUN useradd -r -s /bin/false appuser
COPY --from=builder /app/.venv /app/.venv
COPY --from=builder /app/src ./src
ENV PATH="/app/.venv/bin:$PATH"
USER appuser
CMD ["python", "-m", "src.main"]
```
### Layer Caching
```dockerfile
# WRONG: Cache busted on every code change
COPY . .
RUN npm ci
# CORRECT: Dependencies cached separately
COPY package*.json ./
RUN npm ci # cached unless package.json changes
COPY . . # only source code changes bust this layer
```
### .dockerignore
```
node_modules
.git
.env
*.md
.vscode
coverage
dist
__pycache__
.pytest_cache
*.pyc
```
### Security
```dockerfile
# Always pin versions
FROM node:20.11.0-alpine # NOT node:latest
# Don't run as root
USER appuser
# Read-only filesystem where possible
# docker run --read-only --tmpfs /tmp myapp
# Scan images
# docker scout cves myimage:latest
# trivy image myimage:latest
```
## CI/CD Pipeline Design
### GitHub Actions Structure
```yaml
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: "npm"
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-latest
needs: lint
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
ports: ["5432:5432"]
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: "npm"
- run: npm ci
- run: npm test
build:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
push: ${{ github.event_name == 'push' }}
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main'
environment: production
steps:
- run: echo "Deploy to production"
```
### Caching Strategies
```yaml
# Node modules
- uses: actions/setup-node@v4
with:
cache: "npm"
# Python with uv
- name: Cache uv
uses: actions/cache@v4
with:
path: ~/.cache/uv
key: uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}
# Docker layer caching
- uses: docker/build-push-action@v5
with:
cache-from: type=gha
cache-to: type=gha,mode=max
```
## Deployment Strategies
### Blue-Green Deployment
```
1. Run two identical environments: Blue (live) and Green (idle)
2. Deploy new version to Green
3. Run smoke tests on Green
4. Switch load balancer to Green
5. Green is now live, Blue is idle
6. Rollback: switch back to Blue
Pros: Instant rollback, zero downtime
Cons: 2x infrastructure cost during deploy
```
### Canary Deployment
```
1. Deploy new version to small subset (5% of traffic)
2. Monitor error rates and latency
3. Gradually increase: 5% -> 25% -> 50% -> 100%
4. Rollback: route all traffic back to old version
Pros: Limited blast radius, real-world testing
Cons: More complex routing, longer rollout
```
### Rolling Deployment
```
1. Replace instances one at a time
2. Each new instance passes health checks before next starts
3. Continue until all instances updated
Pros: No extra infrastructure, gradual rollout
Cons: Mixed versions during deploy, slower rollback
```
### Feature Flags
```typescript
// Simple feature flag implementation
const features = {
NEW_CHECKOUT: process.env.FF_NEW_CHECKOUT === "true",
DARK_MODE: process.env.FF_DARK_MODE === "true",
};
function getCheckoutFlow(user: User) {
if (features.NEW_CHECKOUT && user.betaGroup) {
return newCheckoutFlow(user);
}
return legacyCheckoutFlow(user);
}
// Use a proper service for production: LaunchDarkly, Unleash, Flagsmith
```
## Infrastructure as Code
### Terraform Basics
```hcl
# main.tf
terraform {
required_version = ">= 1.5"
backend "s3" {
bucket = "myapp-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "web" {
ami = var.ami_id
instance_type = var.instance_type
tags = {
Name = "web-${var.environment}"
Environment = var.environment
ManagedBy = "terraform"
}
}
# variables.tf
variable "environment" {
type = string
default = "dev"
}
variable "instance_type" {
type = string
default = "t3.micro"
}
```
### Terraform Rules
```
1. Always use remote state (S3, GCS, Terraform Cloud)
2. Lock state files to prevent concurrent modifications
3. Use variables and modules for reusability
4. Tag all resources with environment and ManagedBy
5. Run `terraform plan` before `terraform apply`
6. Never edit infrastructure manually (all changes via code)
7. Use workspaces or separate state files per environment
```
## Monitoring & Observability
### The Three Pillars
```
METRICS: Numeric measurements over time
- Request rate, error rate, latency (RED method)
- CPU, memory, disk, network (USE method)
- Business metrics (signups, purchases)
Tools: Prometheus, Datadog, CloudWatch
LOGS: Discrete events with context
- Structured JSON format
- Correlation IDs across services
- Log levels: DEBUG, INFO, WARN, ERROR
Tools: ELK Stack, Loki, CloudWatch Logs
TRACES: Request flow across services
- Distributed tracing with span context
- Latency breakdown per service
- Dependency mapping
Tools: Jaeger, Zipkin, Datadog APM
```
### Health Check Endpoint
```typescript
// Express health check
app.get("/health", async (req, res) => {
const checks = {
uptime: process.uptime(),
timestamp: Date.now(),
database: "unknown",
redis: "unknown",
};
try {
await db.query("SELECT 1");
checks.database = "healthy";
} catch (e) {
checks.database = "unhealthy";
}
try {
await redis.ping();
checks.redis = "healthy";
} catch (e) {
checks.redis = "unhealthy";
}
const isHealthy = checks.database === "healthy";
res.status(isHealthy ? 200 : 503).json(checks);
});
```
### Alerting Rules
```
Good alerts:
- Error rate > 1% for 5 minutes (actionable)
- P99 latency > 2s for 10 minutes (meaningful)
- Disk usage > 80% (preventive)
Bad alerts:
- CPU spike for 30 seconds (too noisy)
- Any single 500 error (too sensitive)
- "Something might be wrong" (not actionable)
Alert fatigue is real. Every alert should require human action.
```
## Environment Management
### Dev/Staging/Prod Parity
```yaml
# docker-compose.yml for local development
services:
app:
build: .
env_file: .env
ports: ["3000:3000"]
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:16
environment:
POSTGRES_DB: myapp
healthcheck:
test: ["CMD-SHELL", "pg_isready"]
interval: 5s
volumes:
- pgdata:/var/lib/postgresql/data
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes:
pgdata:
```
### Environment Variables
```
# .env.example (committed to git, no real values)
DATABASE_URL=postgresql://user:placeholder@localhost:5432/myapp
REDIS_URL=redis://localhost:6379
LOG_LEVEL=debug
API_KEY=your-key-here
# .env (never committed, listed in .gitignore)
# Contains real values for local development
```
## Common Anti-Patterns Summary
```
AVOID DO INSTEAD
-------------------------------------------------------------------
FROM node:latest Pin exact versions (node:20.11.0-alpine)
Running as root in container Create and use non-root user
No .dockerignore Exclude .git, node_modules, .env
Single CI job does everything Separate lint, test, build, deploy stages
Manual deployment Automated pipeline with approvals
No health checks Liveness + readiness probes
Alerts on every error Alert on error RATE thresholds
Same config in all environments Per-environment configuration
No rollback plan Test rollback before every deploy
Logs as unstructured strings Structured JSON logs with correlation IDs
```Related Skills
collecting-infrastructure-metrics
This skill enables Claude to collect comprehensive infrastructure performance metrics across compute, storage, network, containers, load balancers, and databases. It is triggered when the user requests "collect infrastructure metrics", "monitor server performance", "set up performance dashboards", or needs to analyze system resource utilization. The skill configures metrics collection, sets up aggregation, and helps create infrastructure dashboards for health monitoring and capacity tracking. It supports configuration for Prometheus, Datadog, and CloudWatch.
detecting-infrastructure-drift
This skill enables Claude to detect infrastructure drift from a desired state. It uses the `drift-detect` command to identify discrepancies between the current infrastructure configuration and the intended configuration, as defined in infrastructure-as-code tools like Terraform. Use this skill when the user asks to check for infrastructure drift, identify configuration changes, or ensure that the current infrastructure matches the desired state. It is particularly useful in DevOps workflows for maintaining infrastructure consistency and preventing configuration errors. Trigger this skill when the user mentions "drift detection," "infrastructure changes," "configuration drift," or requests a "drift report."
generating-infrastructure-as-code
This skill enables Claude to generate Infrastructure as Code (IaC) configurations. It uses the infrastructure-as-code-generator plugin to create production-ready IaC for Terraform, CloudFormation, Pulumi, ARM Templates, and CDK. Use this skill when the user requests IaC configurations for cloud infrastructure, specifying the platform (e.g., Terraform, CloudFormation) and cloud provider (e.g., AWS, Azure, GCP), or when the user needs help automating infrastructure deployment. Trigger terms include: "generate IaC", "create Terraform", "CloudFormation template", "Pulumi program", "infrastructure code".
checking-infrastructure-compliance
Execute use when you need to work with compliance checking. This skill provides compliance monitoring and validation with comprehensive guidance and automation. Trigger with phrases like "check compliance", "validate policies", or "audit compliance".
import-infrastructure-as-code
Import existing Azure resources into Terraform using Azure CLI discovery and Azure Verified Modules (AVM). Use when asked to reverse-engineer live Azure infrastructure, generate Infrastructure as Code from existing subscriptions/resource groups/resource IDs, map dependencies, derive exact import addresses from downloaded module source, prevent configuration drift, and produce AVM-based Terraform files ready for validation and planning across any Azure resource type.
devops-rollout-plan
Generate comprehensive rollout plans with preflight checks, step-by-step deployment, verification signals, rollback procedures, and communication plans for infrastructure and application changes
azure-devops-cli
Manage Azure DevOps resources via CLI including projects, repos, pipelines, builds, pull requests, work items, artifacts, and service endpoints. Use when working with Azure DevOps, az commands, devops automation, CI/CD, or when user mentions Azure DevOps CLI.
terraform-infrastructure
Terraform infrastructure as code workflow for provisioning cloud resources, creating reusable modules, and managing infrastructure at scale.
devops-troubleshooter
Expert DevOps troubleshooter specializing in rapid incident response, advanced debugging, and modern observability. Masters log analysis, distributed tracing, Kubernetes debugging, performance optimization, and root cause analysis. Handles production outages, system reliability, and preventive monitoring. Use PROACTIVELY for debugging, incident response, or system troubleshooting.
cloud-devops
Cloud infrastructure and DevOps workflow covering AWS, Azure, GCP, Kubernetes, Terraform, CI/CD, monitoring, and cloud-native development.
devops-iac-engineer
Implements infrastructure as code using Terraform, Kubernetes, and cloud platforms. Designs scalable architectures, CI/CD pipelines, and observability solutions. Provides security-first DevOps practices and site reliability engineering guidance.
infrastructure-reporting
Generate comprehensive network infrastructure reports including health status, performance analysis, security audits, and capacity planning recommendations.