engineering-metrics

Engineering effectiveness metrics: DORA Four Keys (Deployment Frequency, Lead Time, Change Failure Rate, MTTR), SPACE Framework (Satisfaction, Performance, Activity, Communication, Efficiency), Goodhart's Law pitfalls, Velocity vs. Outcomes, Developer Experience measurement.

8 stars

bymarvinrichter

View on GitHub Installation ↓

Best use case

engineering-metrics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using engineering-metrics should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/engineering-metrics/SKILL.md --create-dirs "https://raw.githubusercontent.com/marvinrichter/clarc/main/skills/engineering-metrics/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/engineering-metrics/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How engineering-metrics Compares

Feature / Agent	engineering-metrics	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Engineering Metrics Skill

"If you can't measure it, you can't improve it" — but measuring the wrong things destroys teams. This skill covers the metrics frameworks that correlate with actual engineering effectiveness, and how to avoid turning metrics into gaming incentives.

## When to Activate

- Setting up an engineering effectiveness program
- Running a quarterly engineering health review
- Reporting engineering team performance to leadership
- Diagnosing why a team feels slow despite high activity
- Designing a Developer Experience (DevEx) initiative
- Establishing a DORA baseline for a newly formed engineering team
- Evaluating whether to track velocity, cycle time, or throughput for a specific team context
- Identifying leading indicators that predict deployment frequency or change failure rate before problems surface

---

## DORA Four Keys

From Google's State of DevOps Research (2019+, DORA Institute), the four metrics that best predict software delivery performance and organizational outcomes.

### 1. Deployment Frequency

**What:** How often does the team successfully deploy to production?

| Performance | Frequency |
|-------------|-----------|
| Elite | Multiple times per day |
| High | Once per day to once per week |
| Medium | Once per week to once per month |
| Low | Less than once per month |

**Why it matters:** High deployment frequency → smaller batches → lower risk → faster feedback.

**How to measure:**
```bash
# GitHub: count successful production deployments
gh api repos/:owner/:repo/deployments \
  --jq '[.[] | select(.environment == "production")] | length'

# Or: count merges to main as a proxy
git log --after="30 days ago" --merges --oneline main | wc -l
```

**Common improvement paths:**
- Trunk-based development (no long-lived branches)
- Feature flags (decouple deploy from release)
- Automated deployment pipeline (remove manual gates)

### 2. Lead Time for Changes

**What:** Time from first code commit to successful production deployment.

| Performance | Lead Time |
|-------------|-----------|
| Elite | < 1 hour |
| High | 1 day to 1 week |
| Medium | 1 week to 1 month |
| Low | 1 to 6 months |

**How to measure:**
```bash
# Approximate: time from PR creation to merge
gh pr list --state=merged --json createdAt,mergedAt \
  --jq '[.[] | {duration: (((.mergedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)) / 3600)}] |
        map(.duration) | add / length'

# Better: time from first commit in branch to deployment
# (requires deployment tracking in your CI/CD tool)
```

**Bottlenecks by phase:**

| Phase | Common Bottleneck | Fix |
|-------|------------------|-----|
| Coding → PR | Large PRs | Break into smaller PRs |
| PR open → merge | Slow reviews | SLA for reviews (e.g., <24h), PR size limit |
| Merge → deploy | Long CI pipeline | Parallelize tests, optimize Docker builds |
| Deploy → stable | Slow rollout | Automated canary, faster health checks |

### 3. Change Failure Rate

**What:** Percentage of deployments that result in a degraded service, requiring hotfix or rollback.

| Performance | Rate |
|-------------|------|
| Elite | 0–15% |
| High | 16–30% |
| Medium | 16–30% (same range as High; score depends on MTTR) |
| Low | 46–60% |

**How to measure:**
```bash
# Manual: incidents created within N hours of a deployment
# Automated: correlate deployment timestamps with PagerDuty/OpsGenie incident creation

# Simplified: rollback rate
git log --oneline --all | grep -i "revert\|rollback\|hotfix" | wc -l
# Divide by total deployments in period
```

### 4. Mean Time to Restore (MTTR)

**What:** How long to recover from a service degradation.

| Performance | MTTR |
|-------------|------|
| Elite | < 1 hour |
| High | < 1 day |
| Medium | 1 day to 1 week |
| Low | > 1 week |

**How to measure:** Incident duration from your on-call system (PagerDuty, OpsGenie, Grafana OnCall).

**Improvement paths:**
- Runbooks for every alert
- Faster incident detection (alerting SLOs)
- Incident commander role (clear ownership)
- Post-mortems → preventive action items

### DORA Team Classification

| Level | Deploy Freq | Lead Time | CFR | MTTR |
|-------|------------|-----------|-----|------|
| Elite | Multiple/day | < 1h | 0–15% | < 1h |
| High | Daily/weekly | 1d–1wk | 16–30% | < 1d |
| Medium | Weekly/monthly | 1wk–1mo | 16–30% | 1d–1wk |
| Low | Monthly/less | 1–6mo | 46–60% | > 1wk |

A team's level = the *lowest* single-metric rating (weakest link).

---

## SPACE Framework

From GitHub Research (2021). Five dimensions of developer productivity:

| Dimension | What It Measures | Example Metrics |
|-----------|-----------------|-----------------|
| **S**atisfaction | Wellbeing, engagement, retention | eNPS, survey scores, attrition rate |
| **P**erformance | Outcomes achieved | Feature delivery, quality (defect rate), reliability |
| **A**ctivity | Work artifacts produced | PRs merged, commits, code reviews completed |
| **C**ommunication | Knowledge flow, collaboration | Cross-team PRs, documentation coverage, review turnaround |
| **E**fficiency | Flow state, low friction | Interruption rate, build time, onboarding time |

**Critical SPACE insight**: Never measure only Activity. A team can maximize commits while delivering zero business value.

Healthy signal: **S + P improving** while A stays constant = efficiency gain.
Warning signal: **A increasing** while S declining = burnout, unsustainable pace.

---

## Goodhart's Law and Gaming

**Goodhart's Law**: "When a measure becomes a target, it ceases to be a good measure."

### Common DORA Gaming Patterns

| Metric | How Teams Game It | Consequence |
|--------|------------------|-------------|
| Deployment Frequency | Deploy config-only changes, trivial PRs | High frequency, no value delivery |
| Lead Time | Mark PRs as created late, skip code review | Fast on paper, poor quality |
| Change Failure Rate | Don't declare incidents, "it was a feature" | Hidden failures, no learning |
| MTTR | Close incidents prematurely, reopen later | Looks fast, actually slow |

### Anti-gaming Principles

1. **Never rank individuals by DORA** — only teams
2. **Never use DORA for performance reviews** — it will be gamed
3. **Show trends, not absolutes** — "improving" matters more than "Elite"
4. **Combine DORA with qualitative signals** — survey + metrics
5. **Let teams own their metrics** — not management measuring teams

---

## Velocity and Story Points

**What velocity is good for**: Sprint planning (capacity estimation), not performance measurement.

**What velocity is bad for**:
- Comparing teams (different story point calibrations)
- Measuring productivity (output ≠ outcome)
- Predicting business value (20 points of non-critical work ≠ 20 points of revenue-critical work)

**Better alternatives to velocity for effectiveness**:
- **Cycle time** (time from "in progress" to "done"): objective, no estimation bias
- **Throughput** (items completed/week): counts finished work, not estimated work
- **Flow efficiency** (active time / total time): % of time item is actually being worked on

---

## Developer Experience (DevEx)

From "DevEx: What Actually Drives Productivity" (Noda et al., 2023), three core factors:

1. **Flow State**: Ability to stay focused without interruptions
2. **Feedback Loops**: Quickly knowing if work is correct (fast CI, fast review)
3. **Cognitive Load**: How hard it is to understand and change the system (documentation, complexity, tooling)

**Quick proxy survey questions (1–7 scale):**
```
1. I can get into a flow state during my work (rarely 1 → often 7)
2. I feel confident that changes I make work correctly before deployment (1 → 7)
3. I understand how my work contributes to company goals (1 → 7)
4. Our development tools support my work effectively (1 → 7)
5. I feel energized by my work rather than drained (1 → 7)
```

Score < 4 on any item = action required.

---

## Leading vs. Lagging Indicators

DORA metrics are **lagging** — they tell you what already happened.

| Leading Indicator | Predicts |
|------------------|---------|
| PR size (lines of code) | Lead time (large PRs → slower review) |
| CI duration | Lead time (slow CI → slow delivery) |
| PR review turnaround | Lead time |
| Test coverage | Change failure rate |
| Incident runbook coverage | MTTR |
| Onboarding time | Team efficiency long-term |

**Track leading indicators weekly** to catch problems before they show in DORA.

---

## Reference Commands

- `/dora-baseline` — measure current DORA baseline for your team
- `/devex-survey` — design and run a developer experience survey
- `/engineering-review` — monthly engineering health review workflow
- `dora-implementation` skill — technical setup for extracting DORA data from GitHub/GitLab

Related Skills

prompt-engineering

from marvinrichter/clarc

System prompt architecture, few-shot design, chain-of-thought, structured output (JSON mode, response_format), tool use patterns, prompt versioning, and regression testing. Use when writing, reviewing, or debugging any LLM prompt — system prompts, user templates, or tool descriptions.

privacy-engineering

from marvinrichter/clarc

Privacy engineering patterns — PII classification and inventory, GDPR consent flows, data minimization, right-to-erasure implementation, pseudonymization/encryption, privacy-by-design architecture, and DPIA checklist.

platform-engineering

from marvinrichter/clarc

Platform Engineering: Internal Developer Platforms (IDP), CNCF Platform definition, Team Topologies, IDP components (Service Catalog, Self-Service Infra, Golden Paths, Developer Portal), platform maturity model, make-vs-buy (Backstage vs Port vs Cortex), adoption strategy, DORA correlation.

data-engineering

from marvinrichter/clarc

Data engineering patterns: dbt for SQL transformation (models, tests, incremental), Dagster for orchestration (assets, jobs, sensors), data quality checks, warehouse patterns (BigQuery/Snowflake/Redshift), and modern data stack setup. Covers the ELT pipeline from raw ingestion to analytics-ready models.

chaos-engineering

from marvinrichter/clarc

Chaos Engineering for production resilience: steady-state hypothesis design, fault injection tools (Chaos Monkey, Litmus, Gremlin, Toxiproxy, tc netem), GameDay format, and maturity model from manual to continuous chaos.

zero-trust-patterns

from marvinrichter/clarc

Zero-Trust security patterns — mTLS between microservices (Istio/SPIFFE), SPIRE workload identity, OPA/Envoy authorization, NetworkPolicy default-deny-all, short-lived credentials, service mesh security, and Kubernetes RBAC hardening.

wireframing

from marvinrichter/clarc

Wireframing and prototyping workflow: fidelity levels (lo-fi sketch → mid-fi wireframe → hi-fi prototype), tool selection (Figma, Excalidraw, Balsamiq), user flow diagrams, wireframe annotation standards, information architecture (IA) mapping, and the handoff from wireframe to visual design. For developers who need to communicate UI structure before writing code.

webrtc-patterns

from marvinrichter/clarc

WebRTC patterns — peer connection setup, ICE/STUN/TURN configuration, signaling server design, SFU vs mesh topology, screen sharing, media track management, and reconnect/ICE restart handling.

webhook-patterns

from marvinrichter/clarc

Webhook patterns for receiving, verifying (HMAC), and idempotently processing third-party events. Covers Stripe, GitHub, and generic webhook patterns, delivery guarantees, retry handling, and testing.

web-performance

from marvinrichter/clarc

Web performance optimization: Core Web Vitals (LCP, CLS, INP), Lighthouse CI with budget configuration, bundle analysis (webpack-bundle-analyzer, vite-bundle-visualizer), hydration performance, network waterfall reading, image optimization (WebP/AVIF, srcset), and font performance.

wasm-performance

from marvinrichter/clarc

WebAssembly performance: wasm-opt binary optimization, size reduction (panic=abort, LTO, strip), profiling WASM in Chrome DevTools, memory management (linear memory, avoiding GC pressure), SIMD, and multi-threading with SharedArrayBuffer.

wasm-patterns

from marvinrichter/clarc

WebAssembly patterns: wasm-pack, wasm-bindgen (JS↔Wasm interop), WASI, Component Model, wasm-opt, Rust-to-WASM compilation, JS integration (web workers, streaming instantiation), and production deployment (CDN, Content-Type headers).