ml-system-design-interview

Coaches end-to-end ML system design interviews covering inference pipelines, recommendation systems, RAG, feature stores, and monitoring. Use for L6+ design rounds, ML architecture whiteboarding, system design practice, serving tradeoff analysis. Activate on "ML system design", "ML interview", "recommendation system design", "RAG architecture", "feature store design", "model serving". NOT for coding interviews, behavioral questions, ML theory quizzes, or paper implementations.

85 stars

bycuriositech

View on GitHub Installation ↓

Best use case

ml-system-design-interview is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using ml-system-design-interview should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ml-system-design-interview/SKILL.md --create-dirs "https://raw.githubusercontent.com/curiositech/some_claude_skills/main/.claude/skills/ml-system-design-interview/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ml-system-design-interview/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ml-system-design-interview Compares

Feature / Agent	ml-system-design-interview	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# ML System Design Interview

End-to-end ML pipeline design coaching for staff+ engineers. Covers the full arc from problem definition through production monitoring -- the scope expected at L6+ interviews at top-tier ML organizations.

This skill assumes 15+ years of ML/CV/AI/NLP experience. It does not teach fundamentals. It structures the knowledge you already have into the format interviewers reward.

---

## When to Use

**Use for**:
- Practicing 45-minute ML system design rounds
- Structuring whiteboard presentations for recommendation, ranking, RAG, fraud, perception systems
- Analyzing serving architecture tradeoffs (batch vs online vs streaming)
- Identifying L6+ differentiation signals (problem ownership, org constraints, data flywheels)
- Reviewing and critiquing ML system design answers

**NOT for**:
- Coding interviews (use `senior-coding-interview`)
- Behavioral / leadership questions (use `interview-loop-strategist`)
- ML theory or math derivations
- Implementing models or writing training code
- Paper reading or research review

---

## The 7-Stage Design Framework

Every ML system design answer follows this arc. The stages are sequential but you will loop back as constraints emerge. The Mermaid diagram below is your whiteboard skeleton.

```mermaid
flowchart TD
    R[1. Requirements\n- Business goal\n- Users and scale\n- Latency/throughput SLA\n- Constraints] --> M[2. Metrics\n- Offline: precision, recall, NDCG\n- Online: CTR, conversion, revenue\n- Guardrails: latency p99, fairness]
    M --> D[3. Data\n- Sources and collection\n- Labeling strategy\n- Pipeline: ETL, validation\n- Freshness and staleness]
    D --> F[4. Features\n- Engineering and transforms\n- Feature store architecture\n- Online vs offline features\n- Freshness requirements]
    F --> Mo[5. Model\n- Architecture selection\n- Training pipeline\n- Iteration strategy\n- Baseline and ablation]
    Mo --> S[6. Serving\n- Batch vs online vs streaming\n- Caching and precomputation\n- Scaling and cost\n- Canary and shadow mode]
    S --> Mon[7. Monitoring\n- Data drift detection\n- Model degradation alerts\n- A/B testing framework\n- Rollback strategy\n- Feedback loops]
    Mon -.->|Feedback loop| D
    Mon -.->|Retrain trigger| Mo
```

### Stage Details

**Stage 1 -- Requirements (5 minutes)**
Ask clarifying questions before designing anything. Establish: Who is the user? What is the business metric? What is the latency SLA? What scale (QPS, data volume)? What are hard constraints (cost, privacy, regulation)? An L6+ candidate owns the problem definition -- do not wait for the interviewer to hand you requirements.

**Stage 2 -- Metrics (3 minutes)**
Define offline metrics that you can measure before deployment AND online metrics that matter to the business. Explain the gap: "NDCG improvement offline does not always translate to CTR lift online because of position bias and novelty effects." Define guardrail metrics: latency p99, fairness across user segments, cost per prediction.

**Stage 3 -- Data (7 minutes)**
Where does training data come from? How is it labeled (human, weak supervision, implicit signals)? What is the class balance? How fresh does data need to be? What is the data pipeline (batch ETL vs streaming)? What data quality checks exist? This stage separates L6+ candidates from L5 -- junior candidates assume clean labeled data.

**Stage 4 -- Features (5 minutes)**
What features does the model need? Which are precomputed (offline) vs computed at request time (online)? Feature store architecture: online store (low-latency lookups) vs offline store (batch training). Feature freshness: user features update daily, item features update hourly, contextual features are real-time.

**Stage 5 -- Model (8 minutes)**
Start with a simple baseline (logistic regression, XGBoost) and explain why. Then propose the production architecture (two-tower, transformer, etc.) and justify the upgrade. Discuss training pipeline: how often, how much data, how to handle distribution shift. Iteration strategy: what experiments to run first.

**Stage 6 -- Serving (8 minutes)**
This is where system design and ML intersect. Discuss: inference latency requirements, batch precomputation vs online inference, GPU/CPU tradeoffs, model serving framework, caching strategy, cost optimization (quantization, distillation, spot instances). Draw the serving architecture.

**Stage 7 -- Monitoring (5 minutes)**
What happens after deployment? Data drift detection (PSI, KL divergence). Model degradation alerts (metric decay over time). A/B testing framework (sample size, duration, novelty effects). Rollback strategy (shadow mode, canary percentage). Feedback loops that improve the model over time.

---

## 45-Minute Time Budget

| Phase | Minutes | What to Cover |
|-------|---------|---------------|
| Requirements + Clarification | 5 | Business goal, users, scale, SLA, constraints |
| Metrics | 3 | Offline, online, guardrails, metric alignment |
| Data | 7 | Sources, labeling, pipeline, quality, freshness |
| Features | 5 | Engineering, store architecture, online/offline split |
| Model | 8 | Baseline, production arch, training, iteration |
| Serving | 8 | Latency, architecture, cost, deployment strategy |
| Monitoring | 5 | Drift, alerts, A/B testing, rollback, feedback |
| Q&A Buffer | 4 | Interviewer deep-dives, defend tradeoffs |

If the interviewer cuts in with questions, adapt -- but cover all 7 stages even briefly. Skipping monitoring is the most common L5 mistake.

---

## Canonical Problem Set

| Problem | Key Challenges | Must-Discuss |
|---------|---------------|--------------|
| Recommendation System | Cold start, position bias, multi-objective optimization | Two-tower retrieval + reranking, exploration-exploitation |
| Search Ranking | Query intent classification, relevance vs engagement, latency at scale | Inverted index + embedding retrieval, L1/L2 ranking cascade |
| Content Moderation | Multi-modal (text+image+video), adversarial evasion, precision-recall tradeoff | Human-in-the-loop, escalation tiers, appeal workflow |
| RAG Pipeline | Retrieval quality, chunk strategy, hallucination detection, evaluation | Embedding model selection, hybrid search, reranking, citation |
| Fraud Detection | Extreme class imbalance, adversarial adaptation, real-time requirement | Feature velocity, graph features, ensemble + rules, feedback delay |
| Autonomous Driving Perception | Sensor fusion, safety-critical latency, long-tail distribution | Multi-task architecture, simulation, OTA updates, regulatory |

---

## Serving Architecture Comparison

| Pattern | Latency | Freshness | Cost | Best For |
|---------|---------|-----------|------|----------|
| Batch prediction | N/A (precomputed) | Hours-stale | Low compute, high storage | Email recommendations, daily reports |
| Online inference | 10-500ms | Real-time | High compute (GPU) | Search ranking, fraud detection |
| Near-real-time | 1-60s | Minutes-fresh | Medium | Feed ranking, content moderation |
| Streaming | Sub-second | Continuous | High (always-on) | Fraud, anomaly detection, bidding |

Detailed serving tradeoffs, framework comparisons, and cost optimization strategies are in `references/serving-tradeoffs.md`.

---

## L6+ Differentiation Signals

What separates a staff+ answer from a senior answer:

**1. Own the Problem Definition**
Do not accept the problem as stated. Ask: "What business metric are we optimizing? Is this a revenue problem or an engagement problem? What is the current solution and why is it insufficient?" L5 candidates accept "build a recommendation system." L6+ candidates ask "what are we recommending, to whom, and what does success look like?"

**2. Discuss Organizational Constraints**
Real systems live inside organizations. Address: team size (can we maintain a custom model or should we use a managed service?), on-call burden, cross-team data dependencies, compliance requirements, migration path from legacy system.

**3. Data Flywheel Strategy**
Show that you think about the virtuous cycle: better model -> more engagement -> more data -> better model. Discuss how to accelerate it: active learning, implicit feedback loops, exploration strategies, cold-start bootstrapping.

**4. Build vs Buy Decisions**
Not everything should be custom. Argue for managed services where appropriate (embedding APIs, feature stores, serving platforms) and custom solutions where competitive advantage demands it. Show you understand the total cost of ownership.

**5. Multi-Objective Thinking**
Real systems optimize multiple objectives simultaneously: relevance AND diversity, accuracy AND fairness, quality AND latency. Discuss how to handle conflicts: Pareto optimization, constrained optimization, multi-task learning, business-rule post-processing.

---

## Whiteboard Strategy

**What to draw and when:**

| Time | Draw This | Purpose |
|------|-----------|---------|
| 0-5 min | Requirements box with bullet points | Anchor the discussion, show structured thinking |
| 5-8 min | Metric table (offline vs online) | Demonstrate you think beyond model accuracy |
| 8-15 min | Data pipeline diagram (sources -> ETL -> store) | Show you understand data engineering |
| 15-20 min | Feature architecture (offline store + online store) | Demonstrate feature store knowledge |
| 20-28 min | Model architecture + serving diagram | The core system design artifact |
| 28-36 min | Full system diagram with latency annotations | Connect everything, show you can ship |
| 36-41 min | Monitoring dashboard sketch + feedback arrows | Close the loop, show production thinking |

Use boxes for components, arrows for data flow, and annotate with latency/throughput numbers. The diagram should be readable by someone who walks in at minute 30.

---

## Anti-Patterns

### Model-First Thinking

**Novice**: Jumps to "I would use a transformer" or "Let me describe the attention mechanism" in the first 2 minutes, before understanding the problem, defining metrics, or discussing data. Spends 70% of time on model architecture and 0% on serving.

**Expert**: Spends the first 10 minutes on requirements, metrics, and data before mentioning any model. Names a simple baseline first (logistic regression on handcrafted features), then argues for complexity only when the baseline's limitations are clear. Allocates equal time to serving and monitoring.

**Detection**: Architecture diagram has a detailed model box but no data pipeline, no feature store, no serving layer, and no monitoring component. Mentions model architecture in the first sentence.

### Ignoring the Data

**Novice**: Assumes clean, labeled data exists at scale. Says "we would train on millions of labeled examples" without discussing where labels come from, how much they cost, what the class distribution looks like, or how stale the data gets.

**Expert**: Asks about data sources, labeling strategy (human vs weak supervision vs implicit signals), class imbalance handling, data freshness SLA, and data quality monitoring. Discusses the cost of labeling and proposes strategies to reduce it (active learning, semi-supervised methods, synthetic data).

**Detection**: No discussion of data collection, labeling costs, class imbalance, data quality checks, or data freshness anywhere in the answer. The word "label" does not appear.

### No Monitoring Story

**Novice**: Design ends at the serving layer. No mention of what happens after the model is deployed. Does not discuss how to detect degradation, how to roll back, or how to improve the model over time.

**Expert**: Discusses data drift detection (population stability index, feature distribution monitoring), model performance decay alerts, A/B testing framework with proper statistical rigor, canary deployment strategy, shadow mode for safe rollouts, and explicit feedback loops that flow data back into retraining.

**Detection**: Architecture diagram has no monitoring component. No feedback arrows from production back to training. No mention of A/B testing, canary deployment, or rollback.

---

## Reference Files

Consult these for deep dives -- they are NOT loaded by default:

| File | Consult When |
|------|-------------|
| `references/ml-design-templates.md` | Working through a specific problem (recommendation, search, RAG, fraud, content mod, perception). Contains 6 fully worked designs with Mermaid diagrams. |
| `references/serving-tradeoffs.md` | Deep-diving on serving architecture, framework selection, caching, cost optimization, deployment strategies. Contains framework comparisons and latency targets by use case. |
| `references/evaluation-metrics-guide.md` | Choosing metrics, understanding metric alignment, designing A/B tests, evaluating generative AI. Contains metric decision trees and formulas. |

Related Skills

windows-95-web-designer

from curiositech/some_claude_skills

Modern web applications with authentic Windows 95 aesthetic. Gradient title bars, Start menu paradigm, taskbar patterns, 3D beveled chrome. Extrapolates Win95 to AI chatbots, mobile UIs, responsive layouts. Activate on 'windows 95', 'win95', 'start menu', 'taskbar', 'retro desktop', '95 aesthetic', 'clippy'. NOT for Windows 3.1 (use windows-3-1-web-designer), vaporwave/synthwave, macOS, flat design.

windows-3-1-web-designer

from curiositech/some_claude_skills

Modern web applications with authentic Windows 3.1 aesthetic. Solid navy title bars, Program Manager navigation, beveled borders, single window controls. Extrapolates Win31 to AI chatbots (Cue Card paradigm), mobile UIs (pocket computing). Activate on 'windows 3.1', 'win31', 'program manager', 'retro desktop', '90s aesthetic', 'beveled'. NOT for Windows 95 (use windows-95-web-designer - has gradients, Start menu), vaporwave/synthwave, macOS, flat design.

win31-pixel-art-designer

from curiositech/some_claude_skills

Expert in Windows 3.1 era pixel art and graphics. Creates icons, banners, splash screens, and UI assets with authentic 16/256-color palettes, dithering patterns, and Program Manager styling. Activate on 'win31 icons', 'pixel art 90s', 'retro icons', '16-color', 'dithering', 'program manager icons', 'VGA palette'. NOT for modern flat icons, vaporwave art, or high-res illustrations.

win31-audio-design

from curiositech/some_claude_skills

Expert in Windows 3.1 era sound vocabulary for modern web/mobile apps. Creates satisfying retro UI sounds using CC-licensed 8-bit audio, Web Audio API, and haptic coordination. Activate on 'win31 sounds', 'retro audio', '90s sound effects', 'chimes', 'tada', 'ding', 'satisfying UI sounds'. NOT for modern flat UI sounds, voice synthesis, or music composition.

web-wave-designer

from curiositech/some_claude_skills

Creates realistic ocean and water wave effects for web using SVG filters (feTurbulence, feDisplacementMap), CSS animations, and layering techniques. Use for ocean backgrounds, underwater distortion, beach scenes, ripple effects, liquid glass, and water-themed UI. Activate on "ocean wave", "water effect", "SVG water", "ripple animation", "underwater distortion", "liquid glass", "wave animation", "feTurbulence water", "beach waves", "sea foam". NOT for 3D ocean simulation (use WebGL/Three.js), video water effects (use video editing), physics-based fluid simulation (use canvas/WebGL), or simple gradient backgrounds without wave motion.

web-design-expert

from curiositech/some_claude_skills

Creates unique web designs with brand identity, color palettes, typography, and modern UI/UX patterns. Use for brand identity development, visual design systems, layout composition, and responsive web design. Activate on "web design", "brand identity", "color palette", "UI design", "visual design", "layout". NOT for typography details (use typography-expert), color theory deep-dives (use color-theory-expert), design system tokens (use design-system-creator), or code implementation without design direction.

web-cloud-designer

from curiositech/some_claude_skills

Creates realistic cloud effects for web using SVG filters (feTurbulence, feDisplacementMap), CSS animations, and layering techniques. Use for atmospheric backgrounds, weather effects, skyboxes, parallax scenes, and decorative cloud elements. Activate on "cloud effect", "SVG clouds", "realistic clouds", "atmospheric background", "sky animation", "feTurbulence", "weather effects", "parallax clouds". NOT for 3D rendering (use WebGL/Three.js skills), photo manipulation (use image editing tools), weather data APIs (use data integration skills), or simple CSS gradients without volumetric effects.

vaporwave-glassomorphic-ui-designer

from curiositech/some_claude_skills

Vaporwave + glassomorphic UI designer for photo/memory apps. Masters SwiftUI Material effects, neon pastels, frosted glass blur, retro-futuristic design. Expert in 2025 UI trends (glassmorphism, neubrutalism, Y2K), iOS HIG, dark mode, accessibility, Metal shaders. Activate on 'vaporwave', 'glassmorphism', 'SwiftUI design', 'frosted glass', 'neon aesthetic', 'retro-futuristic', 'Y2K design'. NOT for backend/API (use backend-architect), Windows 3.1 retro (use windows-3-1-web-designer), generic web (use web-design-expert), non-photo apps (use native-app-designer).

values-behavioral-interview

from curiositech/some_claude_skills

Coaches behavioral and values-fit interview preparation with negative framing, deep follow-ups, introspection, and mission alignment. Use for culture-fit rounds, Anthropic behavioral prep, failure stories, and self-awareness drilling. Activate on "behavioral interview", "values interview", "culture fit", "tell me about a failure". NOT for coding interviews, system design, resume writing, or technical deep dives.

tech-presentation-interview

from curiositech/some_claude_skills

Prepares for "reverse system design" rounds where you present YOUR past technical work. Use for project selection, narrative arc structuring, whiteboard diagrams, depth calibration, and hostile Q&A handling. Activate on "tech presentation", "present your work", "reverse system design", "project deep dive". NOT for designing hypothetical systems, resume writing, or career narrative extraction.

systems-thinking

from curiositech/some_claude_skills

Analyze complex systems through stocks, flows, and feedback loops to find high-leverage interventions. For organizational, environmental, social, and technical systems exhibiting circular causality. NOT for linear problems or simple cause-effect chains.

senior-coding-interview

from curiositech/some_claude_skills

Prepare for L6+ coding interviews — in-memory databases, concurrency, state management, iterative follow-ups. Use when practicing real-world system-building problems or preparing communication strategies for live coding. Activate on "coding interview", "staff interview", "codesignal", "live coding", "rate limiter interview". NOT for LeetCode/competitive programming, behavioral interviews, or system design whiteboard.