platform-engineering

Platform Engineering: Internal Developer Platforms (IDP), CNCF Platform definition, Team Topologies, IDP components (Service Catalog, Self-Service Infra, Golden Paths, Developer Portal), platform maturity model, make-vs-buy (Backstage vs Port vs Cortex), adoption strategy, DORA correlation.

8 stars

bymarvinrichter

View on GitHub Installation ↓

Best use case

platform-engineering is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using platform-engineering should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/platform-engineering/SKILL.md --create-dirs "https://raw.githubusercontent.com/marvinrichter/clarc/main/skills/platform-engineering/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/platform-engineering/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How platform-engineering Compares

Feature / Agent	platform-engineering	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Platform Engineering

Reference for building Internal Developer Platforms (IDPs) — from strategy to implementation.

## When to Activate

- Defining an IDP strategy for an engineering organization
- Deciding between Backstage and SaaS alternatives
- Designing a Golden Path for a standard service type
- Measuring platform adoption and impact on DORA metrics
- Planning platform team structure and operating model
- Assessing the current platform maturity level and identifying gaps
- Reducing developer toil by introducing self-service infrastructure provisioning
- Building or auditing a service catalog to track ownership and dependencies across teams

---

## What Is Platform Engineering?

> "Platform engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations."
> — CNCF Platform Engineering Working Group

**Platform as a Product:**
- Internal teams (stream-aligned teams) are the customers
- Platform team has a roadmap, measures NPS, responds to feedback
- Voluntary adoption wins over mandated adoption
- Success metric: developer satisfaction + DORA improvement

**Platform Engineering vs. DevOps:**

| Aspect | DevOps | Platform Engineering |
|--------|--------|---------------------|
| Focus | Culture + collaboration | Tooling + self-service |
| Scope | Team practices | Cross-team infrastructure |
| Measurement | Process metrics | Developer Experience (DevEx) |
| Output | Cultural shift | Paved roads (Golden Paths) |

---

## Team Topologies

From Skelton & Pais — four team types:

```
┌──────────────────────────────────────────────────────────┐
│  Stream-Aligned Teams                                     │
│  (Product teams — build and run features)                 │
│    ┌──────────┐  ┌──────────┐  ┌──────────┐             │
│    │ Team A   │  │ Team B   │  │ Team C   │             │
│    └──────────┘  └──────────┘  └──────────┘             │
├──────────────────────────────────────────────────────────┤
│  Platform Team                                            │
│  (Reduce cognitive load via self-service + Golden Paths)  │
│    ┌──────────────────────────────────────────────┐      │
│    │ Developer Portal (Backstage) + Infra + CI/CD  │      │
│    └──────────────────────────────────────────────┘      │
├──────────────────────────────────────────────────────────┤
│  Enabling Team             │  Complicated-Subsystem Team  │
│  (Coaching, upskilling)    │  (ML platform, data mesh)    │
└──────────────────────────────────────────────────────────┘
```

**Key principle:** Platform team exists to reduce cognitive load of stream-aligned teams. If teams must deeply understand the platform to use it, it's not a platform — it's a dependency.

---

## IDP Components (CNCF Platforms Whitepaper)

### 1. Service Catalog

Central inventory of all services, APIs, libraries, and teams.

**Backstage catalog-info.yaml:**
```yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-service
  description: Handles order creation, payment, and fulfillment
  annotations:
    github.com/project-slug: myorg/order-service
    pagerduty.com/integration-key: abc123
    sonarqube.org/project-key: order-service
  tags:
    - java
    - kafka
    - postgres
spec:
  type: service
  lifecycle: production
  owner: group:order-team
  system: ecommerce
  dependsOn:
    - resource:orders-db
    - resource:payments-queue
  providesApis:
    - order-api
  consumesApis:
    - payment-api
    - inventory-api
```

**What the catalog enables:**
- Dependency graph (who breaks if this changes?)
- Ownership matrix (who owns this? who's on-call?)
- Tech radar (what versions/libs are in use across org?)
- Runbook links, alerts, documentation — all in one place

### 2. Self-Service Infrastructure

Developers create infrastructure via templates — no Ops ticket required.

```yaml
# Backstage Scaffolder template — provision a new database
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: provision-postgres
  title: Provision PostgreSQL Database
spec:
  parameters:
    - title: Database Configuration
      properties:
        name:
          type: string
          description: Database name (will create myorg-{name}-db)
        environment:
          type: string
          enum: [dev, staging, production]
        size:
          type: string
          enum: [small, medium, large]
          description: "small: 10GB, medium: 100GB, large: 1TB"
  steps:
    - id: trigger-terraform
      name: Trigger Terraform
      action: github:actions:dispatch
      input:
        repoUrl: github.com?repo=infra&owner=myorg
        workflowId: provision-database.yml
        branchOrTagName: main
        workflowInputs:
          db_name: ${{ parameters.name }}
          env: ${{ parameters.environment }}
```

### 3. Golden Paths (Paved Roads)

Pre-built, opinionated templates for the most common service types:

```
Golden Path: NodeJS REST API
  ├── Repository template (Backstage Scaffolder)
  ├── Dockerfile (optimized, multi-stage)
  ├── GitHub Actions CI/CD (test → build → deploy)
  ├── Kubernetes manifests (Deployment, Service, HPA)
  ├── Observability (OpenTelemetry pre-wired)
  ├── catalog-info.yaml (pre-filled)
  └── README with onboarding guide

Time from idea → running service: 10 minutes (vs. 2 weeks without)
```

### 4. Developer Portal

Single entry point for all developer tools and documentation:

| Section | What's there |
|---------|-------------|
| Service Catalog | All services, APIs, teams |
| Templates | Golden Paths, database provisioning |
| Docs | TechDocs, architecture decisions |
| CI/CD | GitHub Actions status per service |
| Incidents | PagerDuty active incidents |
| Cost | AWS cost per team/service |

### 5. Observability Platform

Standardized logs/metrics/traces:
- All services use the same logging library (Powertools, OpenTelemetry)
- Single Grafana instance — teams get dashboards from their catalog entry
- Alerts owned by teams, not Ops

---

## Platform Maturity Model

| Level | Description | Indicators |
|-------|-------------|------------|
| **1 — Reactive** | No platform team, Ops does everything manually | Tickets for every deployment, weeks to provision DB |
| **2 — Managed** | Shared infra, but still manual processes | Same tools, some automation, but requires Ops help |
| **3 — Self-Service** | Teams deploy without Ops tickets | Golden Paths exist, 80%+ self-service |
| **4 — Ecosystem** | Platform itself is extensible by teams | Teams contribute plugins, templates, feedback loop |

**Quick assessment:**
```
Q1: How long to create a new database in production? (hours → days = Level 1-2)
Q2: How long to onboard a new engineer to their first commit? (days → weeks = Level 1)
Q3: Can teams deploy without opening an Ops ticket? (no = Level 1-2)
Q4: Do teams know who owns a service that's causing issues? (no = Level 1-2)
```

---

## Make vs. Buy

| Tool | Type | Strengths | Weaknesses | Cost |
|------|------|-----------|------------|------|
| **Backstage** | OSS (self-hosted) | Fully customizable, huge ecosystem, CNCF project | High maintenance, requires dedicated team | Infrastructure + team time |
| **Port** | SaaS | Fast setup, good UX, flexible data model | Cost at scale, vendor lock-in | ~$10-20/dev/mo |
| **Cortex** | SaaS | Strong scorecards/standards enforcement | Less flexible catalog | ~$15-25/dev/mo |
| **OpsLevel** | SaaS | Good maturity tracking | Smaller ecosystem | ~$15/dev/mo |
| **Roadie** | Hosted Backstage | Backstage UX without maintenance burden | Still expensive | ~$25/dev/mo |

**Decision framework:**

```
< 20 engineers + fast time-to-value needed → Port or Cortex (SaaS)
20-100 engineers + Kubernetes-heavy + custom needs → Backstage (self-hosted)
> 100 engineers + large existing k8s infra → Backstage or Roadie
Compliance-heavy (HIPAA, SOC2) → Self-hosted Backstage
```

---

## DORA Correlation

Platform Engineering directly improves DORA metrics:

| DORA Metric | Platform Improvement |
|-------------|---------------------|
| **Deployment Frequency** | Self-service CI/CD templates → teams deploy more often |
| **Lead Time** | Golden Paths remove setup friction → faster first deploy |
| **Change Failure Rate** | Standardized configs/tests → fewer config mistakes |
| **MTTR** | Unified observability + ownership in catalog → faster diagnosis |

> "Teams using IDPs deploy 2.1× more frequently and have 40% shorter lead times."
> — Puppet State of DevOps 2023

---

## Adoption Strategy

The #1 platform failure mode: build it, mandate it, watch teams route around it.

**What works:**
1. **Start with pain** — find the top 3 complaints from stream-aligned teams
2. **Solve one thing extremely well** — service catalog beats trying to boil the ocean
3. **Make it easier than the alternative** — self-service must genuinely save time
4. **Voluntary adoption first** — mandate only after proving value
5. **Measure NPS quarterly** — platform team is a product team
6. **Embedded advocates** — one champion per stream-aligned team
7. **Contribute path** — teams can contribute templates/plugins
8. **Transparent roadmap** — teams see what's coming and can influence it

**Anti-patterns:**
- Mandating adoption before proving value
- Building golden paths without consulting stream-aligned teams
- Platform team as approval bottleneck (vs. enabler)
- Ignoring feedback ("we know better")

## Reference

- `backstage-patterns` — catalog YAML, Scaffolder templates, plugins, TechDocs
- `engineering-metrics` — DORA metrics for measuring platform impact
- `dora-implementation` — technical setup for DORA tracking

Related Skills

prompt-engineering

from marvinrichter/clarc

System prompt architecture, few-shot design, chain-of-thought, structured output (JSON mode, response_format), tool use patterns, prompt versioning, and regression testing. Use when writing, reviewing, or debugging any LLM prompt — system prompts, user templates, or tool descriptions.

privacy-engineering

from marvinrichter/clarc

Privacy engineering patterns — PII classification and inventory, GDPR consent flows, data minimization, right-to-erasure implementation, pseudonymization/encryption, privacy-by-design architecture, and DPIA checklist.

engineering-metrics

from marvinrichter/clarc

Engineering effectiveness metrics: DORA Four Keys (Deployment Frequency, Lead Time, Change Failure Rate, MTTR), SPACE Framework (Satisfaction, Performance, Activity, Communication, Efficiency), Goodhart's Law pitfalls, Velocity vs. Outcomes, Developer Experience measurement.

data-engineering

from marvinrichter/clarc

Data engineering patterns: dbt for SQL transformation (models, tests, incremental), Dagster for orchestration (assets, jobs, sensors), data quality checks, warehouse patterns (BigQuery/Snowflake/Redshift), and modern data stack setup. Covers the ELT pipeline from raw ingestion to analytics-ready models.

chaos-engineering

from marvinrichter/clarc

Chaos Engineering for production resilience: steady-state hypothesis design, fault injection tools (Chaos Monkey, Litmus, Gremlin, Toxiproxy, tc netem), GameDay format, and maturity model from manual to continuous chaos.

zero-trust-patterns

from marvinrichter/clarc

Zero-Trust security patterns — mTLS between microservices (Istio/SPIFFE), SPIRE workload identity, OPA/Envoy authorization, NetworkPolicy default-deny-all, short-lived credentials, service mesh security, and Kubernetes RBAC hardening.

wireframing

from marvinrichter/clarc

Wireframing and prototyping workflow: fidelity levels (lo-fi sketch → mid-fi wireframe → hi-fi prototype), tool selection (Figma, Excalidraw, Balsamiq), user flow diagrams, wireframe annotation standards, information architecture (IA) mapping, and the handoff from wireframe to visual design. For developers who need to communicate UI structure before writing code.

webrtc-patterns

from marvinrichter/clarc

WebRTC patterns — peer connection setup, ICE/STUN/TURN configuration, signaling server design, SFU vs mesh topology, screen sharing, media track management, and reconnect/ICE restart handling.

webhook-patterns

from marvinrichter/clarc

Webhook patterns for receiving, verifying (HMAC), and idempotently processing third-party events. Covers Stripe, GitHub, and generic webhook patterns, delivery guarantees, retry handling, and testing.

web-performance

from marvinrichter/clarc

Web performance optimization: Core Web Vitals (LCP, CLS, INP), Lighthouse CI with budget configuration, bundle analysis (webpack-bundle-analyzer, vite-bundle-visualizer), hydration performance, network waterfall reading, image optimization (WebP/AVIF, srcset), and font performance.

wasm-performance

from marvinrichter/clarc

WebAssembly performance: wasm-opt binary optimization, size reduction (panic=abort, LTO, strip), profiling WASM in Chrome DevTools, memory management (linear memory, avoiding GC pressure), SIMD, and multi-threading with SharedArrayBuffer.

wasm-patterns

from marvinrichter/clarc

WebAssembly patterns: wasm-pack, wasm-bindgen (JS↔Wasm interop), WASI, Component Model, wasm-opt, Rust-to-WASM compilation, JS integration (web workers, streaming instantiation), and production deployment (CDN, Content-Type headers).