nw-sd-framework
4-step system design framework with back-of-envelope estimation, scaling ladder, and common pitfalls
Best use case
nw-sd-framework is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
4-step system design framework with back-of-envelope estimation, scaling ladder, and common pitfalls
Teams using nw-sd-framework should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/nw-sd-framework/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How nw-sd-framework Compares
| Feature / Agent | nw-sd-framework | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
4-step system design framework with back-of-envelope estimation, scaling ladder, and common pitfalls
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# System Design Framework ## The 4-Step Process Every system design follows this structure. Skipping steps is the top mistake. ### Step 1: Understand the Problem and Establish Design Scope (3-10 min) Narrow an impossibly broad question into a tractable problem. **Ask about**: users and scale | most important features | read/write ratio | non-functional requirements (latency, availability, consistency) | existing infrastructure | special constraints (mobile-first, offline, regulatory) **Produce**: functional requirements (3-5 bullets) | non-functional requirements (scale, latency, availability, consistency model) | capacity estimation (QPS, storage, bandwidth) **Red flags if skipped**: designing a system nobody asked for | over-engineering for imaginary scale | missing critical constraints (GDPR, real-time) ### Step 2: Propose High-Level Design and Get Buy-In (10-15 min) Sketch the big picture. Validate before diving deep. **Do**: draw architecture diagram (clients, servers, databases, caches, queues) | define API contract (REST/GraphQL/gRPC -- key endpoints) | design data model (entities, relationships, access patterns) | walk through 1-2 core use cases end-to-end | get buy-in: "Does this make sense before I go deeper?" **API patterns**: RESTful for CRUD-heavy | GraphQL for flexible client queries | gRPC for internal service-to-service | WebSocket/SSE for real-time **Data model**: SQL vs NoSQL based on access patterns, not hype | denormalization trade-offs | partitioning key selection (directly impacts scalability) ### Step 3: Design Deep Dive (10-25 min) Go deep on 2-3 components. **Choose**: most technically challenging | most interesting trade-offs | bottleneck components (highest load, most failure-prone) **Depth means**: specific algorithms (consistent hashing, Bloom filters) | failure modes and handling | scaling strategy per component | data flow with edge cases | monitoring and operational concerns ### Step 4: Wrap Up (3-5 min) **Cover**: summarize design in 2-3 sentences | identify known bottlenecks | what you'd improve with more time | operational concerns (monitoring, alerting, deployment) | future enhancements **Avoid**: introducing entirely new components at this stage | second-guessing your design --- ## Back-of-Envelope Estimation ### Powers of 2 Reference | Power | Value | Meaning | |-------|-------|---------| | 10 | 1 Thousand | 1 KB | | 20 | 1 Million | 1 MB | | 30 | 1 Billion | 1 GB | | 40 | 1 Trillion | 1 TB | | 50 | 1 Quadrillion | 1 PB | ### Latency Numbers Every Engineer Should Know | Operation | Latency | |-----------|---------| | L1 cache reference | 0.5 ns | | L2 cache reference | 7 ns | | Main memory reference | 100 ns | | Compress 1KB (Zippy) | 10 us | | Send 2KB over 1 Gbps | 20 us | | Read 1 MB from memory | 250 us | | Datacenter round trip | 500 us | | Disk seek | 10 ms | | Read 1 MB from network | 10 ms | | Read 1 MB from disk | 30 ms | | CA to Netherlands round trip | 150 ms | **Key takeaways**: memory fast, disk slow -- cache aggressively | compress before network send | inter-datacenter trips expensive -- minimize cross-region calls ### Common Estimation Patterns **DAU to QPS**: `QPS = DAU * actions_per_user / 86400` | Peak QPS = QPS * 2 (or *3 for spiky) **Storage**: `daily = DAU * actions * avg_size` | yearly = daily * 365 | 5-year = yearly * 5 **Bandwidth**: `QPS * average_response_size` **Servers**: `Peak QPS / QPS_per_server` where CPU-bound ~hundreds | IO-bound with cache ~thousands | static content ~tens of thousands ### Estimation Example: Twitter-like Service ``` 150M DAU, 2 tweets/day, 10 reads/day Write QPS = 150M * 2 / 86400 ~ 3,500 Read QPS = 150M * 10 / 86400 ~ 17,000; Peak ~ 50,000 Storage: 300M tweets * 1KB + 30M media * 500KB ~ 15.3 TB/day ``` --- ## Scaling Ladder Each step solves a specific bottleneck. Never introduce a component without articulating which bottleneck it addresses. 1. **Load balancer** -- distribute traffic across web servers 2. **Database replication** -- master-slave for read scaling 3. **Cache layer** -- reduce database load (Redis/Memcached) 4. **CDN** -- serve static content from edge 5. **Stateless web tier** -- move session state to shared store 6. **Database sharding** -- horizontal partitioning for write scaling 7. **Message queue** -- decouple components, handle spikes 8. **Logging, metrics, monitoring** -- observability at scale 9. **Multiple data centers** -- geographic redundancy and latency reduction --- ## Common Pitfalls 1. **Jumping to solutions** -- design before understanding requirements 2. **Over-engineering** -- adding components for imaginary scale 3. **Ignoring trade-offs** -- every choice has a cost; name it 4. **SPOF blindness** -- always ask "what if this dies?" 5. **Neglecting data** -- the data model drives everything 6. **Forgetting operations** -- a system you can't monitor is one you can't run 7. **Not doing math** -- gut feelings are wrong; estimates keep you honest
Related Skills
nw-quality-framework
Quality gates - 11 commit readiness gates, build/test protocol, validation checkpoints, and quality metrics
nw-post-mortem-framework
Blameless post-mortem structure, incident timeline reconstruction, response evaluation, and organizational learning
nw-outcome-kpi-framework
Outcome KPI definition methodology - synthesizes Who Does What By How Much (Gothelf/Seiden), Running Lean (Maurya), and Measure What Matters (Doerr) into a practical framework for measurable outcome KPIs
nw-divio-framework
DIVIO/Diataxis four-quadrant documentation framework - type definitions, classification decision tree, and signal catalog
nw-ux-web-patterns
Web UI design patterns for product owners. Load when designing web application interfaces, writing web-specific acceptance criteria, or evaluating responsive designs.
nw-ux-tui-patterns
Terminal UI and CLI design patterns for product owners. Load when designing command-line tools, interactive terminal applications, or writing CLI-specific acceptance criteria.
nw-ux-principles
Core UX principles for product owners. Load when evaluating interface designs, writing acceptance criteria with UX requirements, or reviewing wireframes and mockups.
nw-ux-emotional-design
Emotional design and delight patterns for product owners. Load when designing onboarding flows, empty states, first-run experiences, or evaluating the emotional quality of an interface.
nw-ux-desktop-patterns
Desktop application UI patterns for product owners. Load when designing native or cross-platform desktop applications, writing desktop-specific acceptance criteria, or evaluating panel layouts and keyboard workflows.
nw-user-story-mapping
User story mapping for backlog management and outcome-based prioritization. Load during Phase 2.5 (User Story Mapping) to produce story-map.md and prioritization.md.
nw-tr-review-criteria
Review dimensions and scoring for root cause analysis quality assessment
nw-tlaplus-verification
TLA+ formal verification for design correctness and PBT pipeline integration