nw-infrastructure-and-observability
Infrastructure as Code patterns (Terraform, Kubernetes), observability design (SLOs, metrics, alerting, dashboards), and pipeline security stages. Load when designing infrastructure, observability, or security scanning.
Best use case
nw-infrastructure-and-observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Infrastructure as Code patterns (Terraform, Kubernetes), observability design (SLOs, metrics, alerting, dashboards), and pipeline security stages. Load when designing infrastructure, observability, or security scanning.
Teams using nw-infrastructure-and-observability should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/nw-infrastructure-and-observability/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How nw-infrastructure-and-observability Compares
| Feature / Agent | nw-infrastructure-and-observability | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Infrastructure as Code patterns (Terraform, Kubernetes), observability design (SLOs, metrics, alerting, dashboards), and pipeline security stages. Load when designing infrastructure, observability, or security scanning.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Infrastructure as Code and Observability
## Terraform Patterns
### Module Structure
`main.tf` (resource definitions) | `variables.tf` (input declarations) | `outputs.tf` (output declarations) | `versions.tf` (provider/terraform version constraints) | `README.md` (module docs).
### State Management
Remote backend: S3/GCS/Azure Blob with state locking. State locking: DynamoDB/Cloud Storage/Azure Blob lease. Workspace strategy: one workspace per environment (dev/staging/prod).
### Security
Never commit secrets -- use secret managers | Encrypt state at rest | Use OIDC for CI/CD auth | Least privilege IAM roles.
### IaC Principles (Kief Morris)
Reproducibility (same input, same output) | Idempotency (safe to run multiple times) | Immutability (replace, do not modify) | Version control (track all changes).
### IaC Patterns
- **Stack pattern**: Complete infrastructure as single unit
- **Library pattern**: Reusable infrastructure modules
- **Pipeline pattern**: Infrastructure changes through CI/CD
## Kubernetes Patterns
### Core Concepts
Pods | Deployments | Services | Ingress | ConfigMaps | Secrets | PersistentVolumes | RBAC | NetworkPolicies | PodSecurityPolicies | Operators | Custom Resources | Controllers.
### Production Patterns
Multi-tenancy with namespaces | Resource quotas and limits | Pod disruption budgets | Horizontal and vertical autoscaling.
### Deployment Template
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .name }}
labels:
app: {{ .name }}
version: {{ .version }}
spec:
replicas: {{ .replicas }}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
template:
spec:
containers:
- name: {{ .name }}
image: {{ .image }}:{{ .tag }}
resources:
requests:
memory: {{ .memoryRequest }}
cpu: {{ .cpuRequest }}
limits:
memory: {{ .memoryLimit }}
cpu: {{ .cpuLimit }}
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
```
### HPA Template
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ .name }}
minReplicas: {{ .minReplicas }}
maxReplicas: {{ .maxReplicas }}
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
```
## Observability Design
### SLO Design
**Availability SLO**: `successful_requests / total_requests * 100`
- 99.9% = 8.76h downtime/year | 99.95% = 4.38h | 99.99% = 52.6min
- Error budget = 100% - SLO target
**Latency SLO**: `requests_under_threshold / total_requests * 100`
- 99% of requests < 200ms | 99.9% of requests < 1000ms
### Metrics Methods
**RED Method** (request-driven services): Rate (requests/sec) | Errors (error rate %) | Duration (latency p50, p90, p99).
**USE Method** (resources -- CPU, memory, disk): Utilization (% used) | Saturation (queue depth, waiting requests) | Errors (error counts).
**Four Golden Signals** (Google SRE): Latency | Traffic | Errors | Saturation.
### SLO-Based Alerting
- Fast burn: >14.4x burn rate for 1 hour -> page
- Slow burn: >6x burn rate for 6 hours -> ticket
- Budget nearly exhausted: >50% consumed -> warning
Alert structure: alertname | severity | service | SLO name | current value | threshold | runbook URL | dashboard URL.
### Dashboard Design (per service)
Request rate (RPS) | Error rate (%) | Latency distribution (p50, p90, p99) | SLO status and error budget | Resource utilization (CPU, memory) | Dependency health.
### Three Pillars of Observability (Charity Majors)
- **Logs**: Event records with structured context. Use structured logging with correlation IDs.
- **Metrics**: Numeric measurements over time. Use RED/USE/Golden Signals.
- **Traces**: Request flow across services. Use distributed tracing with sampling.
Principles: high cardinality is essential | debug in production | understand unknown unknowns.
## Pipeline Security
### Security Stages
**Pre-commit**: Secrets scanning (pre-commit hooks) | linting. Tools: pre-commit | gitleaks | detect-secrets.
**Commit stage**: SAST | dependency scanning (SCA) | license compliance | secrets scanning. Tools: Semgrep/CodeQL/Bandit/SonarQube (SAST) | Dependabot/Snyk/Trivy (SCA) | Gitleaks/TruffleHog (secrets).
**Build stage**: Container image scanning | SBOM generation | image signing. Tools: Trivy/Grype/Clair (scanning) | Syft/CycloneDX (SBOM) | Cosign/Notary (signing).
**Pre-production**: DAST | API security testing | infrastructure security scanning. Tools: OWASP ZAP/Nuclei (DAST) | Checkov/tfsec/Terrascan (infrastructure).
**Runtime**: Runtime security monitoring | network policy enforcement | admission control. Tools: Falco/Sysdig (runtime) | OPA Gatekeeper/Kyverno (admission).
### Secrets Management
Principles: never commit secrets | use short-lived credentials | rotate regularly | audit access.
- External secrets: fetch from vault at runtime (HashiCorp Vault | AWS Secrets Manager | GCP Secret Manager)
- SOPS: encrypt secrets in git with GPG/KMS (for GitOps workflows)
### Supply Chain Security
- SBOM: Software Bill of Materials in SPDX or CycloneDX format, generated during build
- SLSA levels: L1 (documented build) | L2 (version control + build service) | L3 (isolated builds + signed provenance) | L4 (two-party review + hermetic builds)Related Skills
nw-ux-web-patterns
Web UI design patterns for product owners. Load when designing web application interfaces, writing web-specific acceptance criteria, or evaluating responsive designs.
nw-ux-tui-patterns
Terminal UI and CLI design patterns for product owners. Load when designing command-line tools, interactive terminal applications, or writing CLI-specific acceptance criteria.
nw-ux-principles
Core UX principles for product owners. Load when evaluating interface designs, writing acceptance criteria with UX requirements, or reviewing wireframes and mockups.
nw-ux-emotional-design
Emotional design and delight patterns for product owners. Load when designing onboarding flows, empty states, first-run experiences, or evaluating the emotional quality of an interface.
nw-ux-desktop-patterns
Desktop application UI patterns for product owners. Load when designing native or cross-platform desktop applications, writing desktop-specific acceptance criteria, or evaluating panel layouts and keyboard workflows.
nw-user-story-mapping
User story mapping for backlog management and outcome-based prioritization. Load during Phase 2.5 (User Story Mapping) to produce story-map.md and prioritization.md.
nw-tr-review-criteria
Review dimensions and scoring for root cause analysis quality assessment
nw-tlaplus-verification
TLA+ formal verification for design correctness and PBT pipeline integration
nw-test-refactoring-catalog
Detailed refactoring mechanics with step-by-step procedures, and test code smell catalog with detection patterns and before/after examples
nw-test-organization-conventions
Test directory structure patterns by architecture style, language conventions, naming rules, and fixture placement. Decision tree for selecting test organization strategy.
nw-test-design-mandates
Four design mandates for acceptance tests - hexagonal boundary enforcement, business language abstraction, user journey completeness, walking skeleton strategy, and pure function extraction
nw-tdd-review-enforcement
Test design mandate enforcement, test budget validation, 5-phase TDD validation, and external validity checks for the software crafter reviewer