nw-data-architecture-patterns

Data architecture patterns (warehouse, lake, lakehouse, mesh), ETL/ELT pipelines, streaming architectures, scaling strategies, and schema design patterns

322 stars

bynWave-ai

View on GitHub Installation ↓

Best use case

nw-data-architecture-patterns is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Data architecture patterns (warehouse, lake, lakehouse, mesh), ETL/ELT pipelines, streaming architectures, scaling strategies, and schema design patterns

Teams using nw-data-architecture-patterns should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/nw-data-architecture-patterns/SKILL.md --create-dirs "https://raw.githubusercontent.com/nWave-ai/nWave/main/nWave/skills/nw-data-architecture-patterns/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/nw-data-architecture-patterns/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How nw-data-architecture-patterns Compares

Feature / Agent	nw-data-architecture-patterns	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Data architecture patterns (warehouse, lake, lakehouse, mesh), ETL/ELT pipelines, streaming architectures, scaling strategies, and schema design patterns

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Data Architecture Patterns

## Architecture Selection Decision Tree

Structured only -> **Data Warehouse** | Mixed + SQL analytics -> **Data Lakehouse** | Mixed + ML-primary -> **Data Lake** | Large org + autonomous domains -> **Data Mesh**

## Data Warehouse

Schema: structured, schema-on-write | Data: tables, rows, columns | Governance: centralized | Query: SQL analytics, BI | Architecture: centralized single source of truth

### Schema Patterns

**Star Schema**: Central fact table (measures) surrounded by denormalized dimension tables. Best for BI dashboards, standard reporting.

**Snowflake Schema**: Normalized dimensions (dimensions reference other dimensions). Reduces storage, increases JOIN complexity. Best when storage cost matters more than query speed.

### Kimball vs Inmon

**Kimball (Bottom-Up)**: Build data marts first, integrate later | Star schema, business-process driven | Faster initial delivery | Best for quick wins, department-level analytics

**Inmon (Top-Down)**: Build enterprise DW first, derive data marts | Normalized 3NF enterprise model | Higher upfront effort | Best for large enterprises needing single source of truth

Technology: Snowflake | Amazon Redshift | Google BigQuery | Azure Synapse Analytics

## Data Lake

Schema-on-read, flexible | All formats (structured, semi-structured, unstructured) | Raw data in native format | Query via Athena, Spark SQL, PySpark, Pandas | Risk: "data swamp" without governance

### Organization
Zones: **raw** (landing, original format) -> **curated** (cleaned, validated) -> **processed** (transformed for use cases) -> **archive** (cold storage)

### Anti-Patterns
- No metadata catalog -> undiscoverable data
- No access controls -> security/compliance risk
- No data quality checks -> garbage in/out
- No retention policy -> unbounded cost growth

Technology: S3 + Athena/Glue | Azure Data Lake Storage + Synapse | HDFS + Hive

## Data Lakehouse

Combines warehouse reliability with lake flexibility | Schema enforcement on write with evolution support | ACID transactions on lake storage | Supports both BI/SQL and ML/data science workloads

### Medallion Architecture (Bronze / Silver / Gold)

**Bronze**: Raw data as-is, append-only for auditability, partitioned by ingestion date, schema-on-read
**Silver**: Quality rules (null checks, range validation, referential integrity) | Deduplication on business keys | Schema enforced | SCD applied
**Gold**: Business-level aggregations | Dimensional models (star/snowflake) | Pre-computed metrics/KPIs | Optimized for BI/reporting

Technology: Databricks (Delta Lake) | Apache Iceberg | Apache Hudi

## Data Mesh

### Core Principles (Martin Fowler)
1. **Domain-oriented ownership**: Data owned by domain teams, not central
2. **Data as a product**: Each domain publishes discoverable, trustworthy, self-describing data products
3. **Self-serve data platform**: Infrastructure team provides platform for domain teams
4. **Federated computational governance**: Global standards with domain autonomy

**Use when**: Large org with autonomous domain teams | Central data team is bottleneck | Domain expertise needed | Platform engineering maturity exists
**Avoid when**: Small team (<50 engineers) | Simple data needs | No platform capability | Unclear domain boundaries

## ETL vs ELT Pipeline Design

### ETL (Extract-Transform-Load)
Transform before loading via dedicated engine (Informatica, Talend, SSIS). Best for complex transforms, constrained targets, regulatory requirements. Scaling limited by transform engine.

### ELT (Extract-Load-Transform)
Load raw first, transform using target compute (dbt, Snowflake SQL, BigQuery SQL). Best for cloud DWs with elastic compute, preserving raw data. Scales with target system.

### Pipeline Design Principles
- **Idempotency**: Re-running produces same result (use MERGE/upsert, not INSERT)
- **Incremental processing**: Process only new/changed data (watermarks, CDC)
- **Schema evolution**: Handle added/removed columns gracefully (schema registry)
- **Data quality gates**: Validate between stages (null rates, row counts, value ranges)
- **Observability**: Log metrics (rows processed, duration, errors, freshness)

### Orchestration
Apache Airflow: DAG-based, Python-native, wide adoption | Prefect: modern, dynamic workflows | Dagster: software-defined assets

## Streaming Architecture

### Apache Kafka
Distributed event streaming platform. Concepts: topics, partitions, consumer groups, offsets. At-least-once delivery (exactly-once with transactions). Use as event bus, message broker, stream storage.

### Apache Flink
Stateful stream processing engine. Concepts: DataStreams, windows (tumbling, sliding, session), state management. Exactly-once with checkpointing. Common pattern: Sources -> Kafka (durable event buffer) -> Flink (stateful compute) -> Sinks.

### Architecture Selection
**Streaming**: real-time dashboards, fraud detection, IoT, event-driven | **Batch**: overnight reporting, historical analysis, ML training | **Lambda**: parallel batch + stream (complex, prefer Kappa) | **Kappa**: stream-only, reprocess from Kafka log (simpler)

## Scaling Strategies

### Vertical (Scale Up)
Add CPU/RAM/storage to existing server | Simpler ops, no app changes | Hard limit: largest hardware | Use first for moderate growth

### Horizontal (Scale Out)

**Read Replicas**: Replicate to read-only copies | Route reads to replicas, writes to primary | Trade-off: replication lag (eventual consistency) | Use for read-heavy workloads

**Partitioning (Single Server)**: Range (date, alphabetical) | List (region, category) | Hash (even distribution) | Benefits: query pruning, maintenance (drop old partitions)

**Sharding (Multiple Servers)**: Distribute data across DB instances by shard key | Strategies: range-based, hash-based, directory-based, geographic

**Shard Key Selection** (most impactful decision):
- High cardinality for even distribution
- Even access frequency to avoid hot shards
- Query alignment: most queries target single shard
- Avoid monotonically increasing keys (hot spots)

**Challenges**: Cross-shard queries need scatter-gather | Distributed transactions (2PC) complex/slow | Resharding expensive | App complexity increases

### Scaling Decision Guide
Not exceeding single server -> optimize queries/indexes first | Read-heavy -> add read replicas | Write-heavy + partitionable -> partition then shard | Write-heavy + not partitionable -> write-optimized DBs (Cassandra, DynamoDB)

## Normalization vs Denormalization

**Normalize (3NF)**: OLTP with frequent writes | Data integrity paramount | Storage optimization | Write > read performance
**Denormalize**: OLAP/analytics (star schema) | Read-heavy, predictable queries | Query > write performance | Acceptable redundancy

**Practical approach**: Start normalized for transactional tables | Add denormalized/materialized views for reporting | Denormalize selectively based on measured performance | Document decisions and rationale

Related Skills

nw-ux-web-patterns

322

from nWave-ai/nWave

Web UI design patterns for product owners. Load when designing web application interfaces, writing web-specific acceptance criteria, or evaluating responsive designs.

nw-ux-tui-patterns

322

from nWave-ai/nWave

Terminal UI and CLI design patterns for product owners. Load when designing command-line tools, interactive terminal applications, or writing CLI-specific acceptance criteria.

nw-ux-desktop-patterns

322

from nWave-ai/nWave

Desktop application UI patterns for product owners. Load when designing native or cross-platform desktop applications, writing desktop-specific acceptance criteria, or evaluating panel layouts and keyboard workflows.

nw-sd-patterns

322

from nWave-ai/nWave

Core distributed systems patterns - load balancing, caching, sharding, consistent hashing, message queues, rate limiting, CDN, Bloom filters, ID generation, replication, conflict resolution, CAP theorem

nw-sd-patterns-advanced

322

from nWave-ai/nWave

Advanced distributed patterns - event sourcing, CQRS, saga, stream processing, append-only log, exactly-once delivery, sequencer, double-entry ledger, erasure coding, order book, watermarks

nw-fp-hexagonal-architecture

322

from nWave-ai/nWave

Hexagonal architecture patterns with pure core and side-effect shell for functional codebases

nw-design-patterns

322

from nWave-ai/nWave

7 agentic design patterns with decision tree for choosing the right pattern for each agent type

nw-database-technology-selection

322

from nWave-ai/nWave

Database comparison catalogs, RDBMS vs NoSQL selection criteria, CAP/ACID/BASE theory, OLTP vs OLAP, and technology-specific characteristics

nw-command-design-patterns

322

from nWave-ai/nWave

Best practices for command definition files - size targets, declarative template, anti-patterns, and canonical examples based on research evidence

nw-architecture-patterns

322

from nWave-ai/nWave

Comprehensive architecture patterns, methodologies, quality frameworks, and evaluation methods for solution architects. Load when designing system architecture or selecting patterns.

nw-ux-principles

322

from nWave-ai/nWave

Core UX principles for product owners. Load when evaluating interface designs, writing acceptance criteria with UX requirements, or reviewing wireframes and mockups.

nw-ux-emotional-design

322

from nWave-ai/nWave

Emotional design and delight patterns for product owners. Load when designing onboarding flows, empty states, first-run experiences, or evaluating the emotional quality of an interface.