nw-infrastructure-and-observability

Infrastructure as Code patterns (Terraform, Kubernetes), observability design (SLOs, metrics, alerting, dashboards), and pipeline security stages. Load when designing infrastructure, observability, or security scanning.

322 stars

bynWave-ai

View on GitHub Installation ↓

Best use case

nw-infrastructure-and-observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using nw-infrastructure-and-observability should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/nw-infrastructure-and-observability/SKILL.md --create-dirs "https://raw.githubusercontent.com/nWave-ai/nWave/main/nWave/skills/nw-infrastructure-and-observability/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/nw-infrastructure-and-observability/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How nw-infrastructure-and-observability Compares

Feature / Agent	nw-infrastructure-and-observability	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Infrastructure as Code and Observability

## Terraform Patterns

### Module Structure
`main.tf` (resource definitions) | `variables.tf` (input declarations) | `outputs.tf` (output declarations) | `versions.tf` (provider/terraform version constraints) | `README.md` (module docs).

### State Management
Remote backend: S3/GCS/Azure Blob with state locking. State locking: DynamoDB/Cloud Storage/Azure Blob lease. Workspace strategy: one workspace per environment (dev/staging/prod).

### Security
Never commit secrets -- use secret managers | Encrypt state at rest | Use OIDC for CI/CD auth | Least privilege IAM roles.

### IaC Principles (Kief Morris)
Reproducibility (same input, same output) | Idempotency (safe to run multiple times) | Immutability (replace, do not modify) | Version control (track all changes).

### IaC Patterns
- **Stack pattern**: Complete infrastructure as single unit
- **Library pattern**: Reusable infrastructure modules
- **Pipeline pattern**: Infrastructure changes through CI/CD

## Kubernetes Patterns

### Core Concepts
Pods | Deployments | Services | Ingress | ConfigMaps | Secrets | PersistentVolumes | RBAC | NetworkPolicies | PodSecurityPolicies | Operators | Custom Resources | Controllers.

### Production Patterns
Multi-tenancy with namespaces | Resource quotas and limits | Pod disruption budgets | Horizontal and vertical autoscaling.

### Deployment Template
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .name }}
  labels:
    app: {{ .name }}
    version: {{ .version }}
spec:
  replicas: {{ .replicas }}
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  template:
    spec:
      containers:
      - name: {{ .name }}
        image: {{ .image }}:{{ .tag }}
        resources:
          requests:
            memory: {{ .memoryRequest }}
            cpu: {{ .cpuRequest }}
          limits:
            memory: {{ .memoryLimit }}
            cpu: {{ .cpuLimit }}
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
```

### HPA Template
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ .name }}
  minReplicas: {{ .minReplicas }}
  maxReplicas: {{ .maxReplicas }}
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
```

## Observability Design

### SLO Design
**Availability SLO**: `successful_requests / total_requests * 100`
- 99.9% = 8.76h downtime/year | 99.95% = 4.38h | 99.99% = 52.6min
- Error budget = 100% - SLO target

**Latency SLO**: `requests_under_threshold / total_requests * 100`
- 99% of requests < 200ms | 99.9% of requests < 1000ms

### Metrics Methods

**RED Method** (request-driven services): Rate (requests/sec) | Errors (error rate %) | Duration (latency p50, p90, p99).

**USE Method** (resources -- CPU, memory, disk): Utilization (% used) | Saturation (queue depth, waiting requests) | Errors (error counts).

**Four Golden Signals** (Google SRE): Latency | Traffic | Errors | Saturation.

### SLO-Based Alerting
- Fast burn: >14.4x burn rate for 1 hour -> page
- Slow burn: >6x burn rate for 6 hours -> ticket
- Budget nearly exhausted: >50% consumed -> warning

Alert structure: alertname | severity | service | SLO name | current value | threshold | runbook URL | dashboard URL.

### Dashboard Design (per service)
Request rate (RPS) | Error rate (%) | Latency distribution (p50, p90, p99) | SLO status and error budget | Resource utilization (CPU, memory) | Dependency health.

### Three Pillars of Observability (Charity Majors)
- **Logs**: Event records with structured context. Use structured logging with correlation IDs.
- **Metrics**: Numeric measurements over time. Use RED/USE/Golden Signals.
- **Traces**: Request flow across services. Use distributed tracing with sampling.

Principles: high cardinality is essential | debug in production | understand unknown unknowns.

## Pipeline Security

### Security Stages

**Pre-commit**: Secrets scanning (pre-commit hooks) | linting. Tools: pre-commit | gitleaks | detect-secrets.

**Commit stage**: SAST | dependency scanning (SCA) | license compliance | secrets scanning. Tools: Semgrep/CodeQL/Bandit/SonarQube (SAST) | Dependabot/Snyk/Trivy (SCA) | Gitleaks/TruffleHog (secrets).

**Build stage**: Container image scanning | SBOM generation | image signing. Tools: Trivy/Grype/Clair (scanning) | Syft/CycloneDX (SBOM) | Cosign/Notary (signing).

**Pre-production**: DAST | API security testing | infrastructure security scanning. Tools: OWASP ZAP/Nuclei (DAST) | Checkov/tfsec/Terrascan (infrastructure).

**Runtime**: Runtime security monitoring | network policy enforcement | admission control. Tools: Falco/Sysdig (runtime) | OPA Gatekeeper/Kyverno (admission).

### Secrets Management
Principles: never commit secrets | use short-lived credentials | rotate regularly | audit access.
- External secrets: fetch from vault at runtime (HashiCorp Vault | AWS Secrets Manager | GCP Secret Manager)
- SOPS: encrypt secrets in git with GPG/KMS (for GitOps workflows)

### Supply Chain Security
- SBOM: Software Bill of Materials in SPDX or CycloneDX format, generated during build
- SLSA levels: L1 (documented build) | L2 (version control + build service) | L3 (isolated builds + signed provenance) | L4 (two-party review + hermetic builds)

Related Skills

nw-ux-web-patterns

322

from nWave-ai/nWave

Web UI design patterns for product owners. Load when designing web application interfaces, writing web-specific acceptance criteria, or evaluating responsive designs.

nw-ux-tui-patterns

322

from nWave-ai/nWave

Terminal UI and CLI design patterns for product owners. Load when designing command-line tools, interactive terminal applications, or writing CLI-specific acceptance criteria.

nw-ux-principles

322

from nWave-ai/nWave

Core UX principles for product owners. Load when evaluating interface designs, writing acceptance criteria with UX requirements, or reviewing wireframes and mockups.

nw-ux-emotional-design

322

from nWave-ai/nWave

Emotional design and delight patterns for product owners. Load when designing onboarding flows, empty states, first-run experiences, or evaluating the emotional quality of an interface.

nw-ux-desktop-patterns

322

from nWave-ai/nWave

Desktop application UI patterns for product owners. Load when designing native or cross-platform desktop applications, writing desktop-specific acceptance criteria, or evaluating panel layouts and keyboard workflows.

nw-user-story-mapping

322

from nWave-ai/nWave

User story mapping for backlog management and outcome-based prioritization. Load during Phase 2.5 (User Story Mapping) to produce story-map.md and prioritization.md.

nw-tr-review-criteria

322

from nWave-ai/nWave

Review dimensions and scoring for root cause analysis quality assessment

nw-tlaplus-verification

322

from nWave-ai/nWave

TLA+ formal verification for design correctness and PBT pipeline integration

nw-test-refactoring-catalog

322

from nWave-ai/nWave

Detailed refactoring mechanics with step-by-step procedures, and test code smell catalog with detection patterns and before/after examples

nw-test-organization-conventions

322

from nWave-ai/nWave

Test directory structure patterns by architecture style, language conventions, naming rules, and fixture placement. Decision tree for selecting test organization strategy.

nw-test-design-mandates

322

from nWave-ai/nWave

Four design mandates for acceptance tests - hexagonal boundary enforcement, business language abstraction, user journey completeness, walking skeleton strategy, and pure function extraction

nw-tdd-review-enforcement

322

from nWave-ai/nWave

Test design mandate enforcement, test budget validation, 5-phase TDD validation, and external validity checks for the software crafter reviewer