clickhouse-architect

ClickHouse schema design and optimization. TRIGGERS - ClickHouse schema, compression codecs, MergeTree, ORDER BY tuning, partition key.

29 stars

byterrylica

View on GitHub Installation ↓

Best use case

clickhouse-architect is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

ClickHouse schema design and optimization. TRIGGERS - ClickHouse schema, compression codecs, MergeTree, ORDER BY tuning, partition key.

Teams using clickhouse-architect should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/clickhouse-architect/SKILL.md --create-dirs "https://raw.githubusercontent.com/terrylica/cc-skills/main/plugins/quality-tools/skills/clickhouse-architect/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/clickhouse-architect/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How clickhouse-architect Compares

Feature / Agent	clickhouse-architect	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

ClickHouse schema design and optimization. TRIGGERS - ClickHouse schema, compression codecs, MergeTree, ORDER BY tuning, partition key.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# ClickHouse Architect

<!-- ADR: 2025-12-09-clickhouse-architect-skill -->

Prescriptive schema design, compression selection, and performance optimization for ClickHouse (v24.4+). Covers both ClickHouse Cloud (SharedMergeTree) and self-hosted (ReplicatedMergeTree) deployments.

> **Self-Evolving Skill**: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

## When to Use This Skill

Use this skill when:

- Designing ClickHouse table schemas with ORDER BY key selection
- Selecting compression codecs for column types
- Configuring partition keys for data lifecycle management
- Adding performance accelerators (projections, indexes, dictionaries)
- Auditing and optimizing existing ClickHouse schemas

## Core Methodology

### Schema Design Workflow

Follow this sequence when designing or reviewing ClickHouse schemas:

1. **Define ORDER BY key** (3-5 columns, lowest cardinality first)
2. **Select compression codecs** per column type
3. **Configure PARTITION BY** for data lifecycle management
4. **Add performance accelerators** (projections, indexes)
5. **Validate with audit queries** (see scripts/)
6. **Document with COMMENT statements** — ClickHouse table and column COMMENTs are the **single source of truth (SSoT)** for what each column means, how it was computed, and what constraints apply. No external doc, skill, or wiki supersedes the COMMENT. See [`references/schema-documentation.md`](./references/schema-documentation.md)

### ORDER BY Key Selection

The ORDER BY clause is the most critical decision in ClickHouse schema design.

**Rules**:

- Limit to 3-5 columns maximum (each additional column has diminishing returns)
- Place lowest cardinality columns first (e.g., `tenant_id` before `timestamp`)
- Include all columns used in WHERE clauses for range queries
- PRIMARY KEY must be a prefix of ORDER BY (or omit to use full ORDER BY)

**Example**:

```sql
-- Correct: Low cardinality first, 4 columns
CREATE TABLE trades (
    exchange LowCardinality(String),
    symbol LowCardinality(String),
    timestamp DateTime64(3),
    trade_id UInt64,
    price Float64,
    quantity Float64
) ENGINE = MergeTree()
ORDER BY (exchange, symbol, timestamp, trade_id);

-- Wrong: High cardinality first (10x slower queries)
ORDER BY (trade_id, timestamp, symbol, exchange);
```

### Compression Codec Quick Reference

| Column Type              | Default Codec              | Read-Heavy Alternative    | Example                                            |
| ------------------------ | -------------------------- | ------------------------- | -------------------------------------------------- |
| DateTime/DateTime64      | `CODEC(DoubleDelta, ZSTD)` | `CODEC(DoubleDelta, LZ4)` | `timestamp DateTime64(3) CODEC(DoubleDelta, ZSTD)` |
| Float prices/gauges      | `CODEC(Gorilla, ZSTD)`     | `CODEC(Gorilla, LZ4)`     | `price Float64 CODEC(Gorilla, ZSTD)`               |
| Integer counters         | `CODEC(T64, ZSTD)`         | —                         | `count UInt64 CODEC(T64, ZSTD)`                    |
| Slowly changing integers | `CODEC(Delta, ZSTD)`       | `CODEC(Delta, LZ4)`       | `version UInt32 CODEC(Delta, ZSTD)`                |
| String (low cardinality) | `LowCardinality(String)`   | —                         | `status LowCardinality(String)`                    |
| General data             | `CODEC(ZSTD(3))`           | `CODEC(LZ4)`              | Default compression level 3                        |

**When to use LZ4 over ZSTD**: LZ4 provides 1.76x faster decompression. Use LZ4 for read-heavy workloads with monotonic sequences (timestamps, counters). Use ZSTD (default) when compression ratio matters or data patterns are unknown.

**Note on codec combinations**:

Delta/DoubleDelta + Gorilla combinations are blocked by default (`allow_suspicious_codecs`) because Gorilla already performs implicit delta compression internally—combining them is **redundant**, not dangerous. A historical corruption bug (PR #45615, Jan 2023) was fixed, but the blocking remains as a best practice guardrail.

Use each codec family independently for its intended data type:

```sql
-- Correct usage
price Float64 CODEC(Gorilla, ZSTD)              -- Floats: use Gorilla
timestamp DateTime64 CODEC(DoubleDelta, ZSTD)   -- Timestamps: use DoubleDelta
timestamp DateTime64 CODEC(DoubleDelta, LZ4)    -- Read-heavy: use LZ4
```

### PARTITION BY Guidelines

PARTITION BY is for **data lifecycle management**, NOT query optimization.

**Rules**:

- Partition by time units (month, week) for TTL and data management
- Keep partition count under 1000 total across all tables
- Each partition should contain 1-300 parts maximum
- Never partition by high-cardinality columns

**Example**:

```sql
-- Correct: Monthly partitions for TTL management
PARTITION BY toYYYYMM(timestamp)

-- Wrong: Daily partitions (too many parts)
PARTITION BY toYYYYMMDD(timestamp)

-- Wrong: High-cardinality partition key
PARTITION BY user_id
```

### Anti-Patterns Checklist (v24.4+)

| Pattern                         | Severity | Modern Status      | Fix                                                               |
| ------------------------------- | -------- | ------------------ | ----------------------------------------------------------------- |
| Too many parts (>300/partition) | Critical | Still critical     | Reduce partition granularity                                      |
| Small batch inserts (<1000)     | Critical | Still critical     | Batch to 10k-100k rows                                            |
| High-cardinality first ORDER BY | Critical | Still critical     | Reorder: lowest cardinality first                                 |
| No memory limits                | High     | Still critical     | Set `max_memory_usage`                                            |
| Denormalization overuse         | High     | Still critical     | Use dictionaries + materialized views                             |
| Large JOINs                     | Medium   | **180x improved**  | Still avoid for ultra-low-latency                                 |
| Mutations (UPDATE/DELETE)       | Medium   | **1700x improved** | Use lightweight UPDATEs (v24.4+); see DELETE Strategy Guide below |

### DELETE Strategy Guide (v13.49.0+ Best Practices)

Choose the right DELETE strategy based on scope. Ranked fastest to slowest:

| Strategy                    | Syntax                                                    | Speed                        | Use When                                                        |
| --------------------------- | --------------------------------------------------------- | ---------------------------- | --------------------------------------------------------------- |
| `DROP PARTITION`            | `ALTER TABLE t DROP PARTITION (key1, key2, keyN)`         | **Instant** (metadata-only)  | Purge entire partition ranges (months, corrupt data, test data) |
| `DELETE IN PARTITION`       | `ALTER TABLE t DELETE IN PARTITION (...) WHERE condition` | **Fast** (scans 1 partition) | Targeted row removal within a known partition                   |
| `ALTER TABLE DELETE`        | `ALTER TABLE t DELETE WHERE condition`                    | **Slow** (scans all parts)   | Fallback when partition is unknown                              |
| `DELETE FROM` (lightweight) | `DELETE FROM t WHERE condition`                           | Variable                     | **ANTI-PATTERN for write pipelines** — see warning below        |

**Anti-pattern: Lightweight `DELETE FROM` before INSERT**

`DELETE FROM` sets `_row_exists=0` masks instead of physically removing rows. These ghost rows:

- Persist until ClickHouse background merge (unpredictable timing)
- Show up in queries without `FINAL` as phantom data
- Cause false anomalies in monitoring/integrity checks
- Were the root cause of months of phantom Stathera gaps in production (opendeviationbar #269)

**Use `DELETE FROM` only for**: ad-hoc data correction where ghost rows don't matter (analytics cleanup, dev/test). **Never use in write pipelines** where INSERT follows DELETE.

**All DELETE mutations should use**: `SETTINGS mutations_sync = 1` to block until completion (prevents INSERT-DELETE race conditions).

**Partition-aware DELETE tip**: If your partition key includes the columns you're filtering on (e.g., `PARTITION BY (symbol, threshold, toYYYYMM(timestamp))`), use `DELETE IN PARTITION` to scope the scan to a single partition instead of scanning all parts.

### Table Engine Selection

| Deployment          | Engine                | Use Case                        |
| ------------------- | --------------------- | ------------------------------- |
| ClickHouse Cloud    | `SharedMergeTree`     | Default for cloud deployments   |
| Self-hosted cluster | `ReplicatedMergeTree` | Multi-node with replication     |
| Self-hosted single  | `MergeTree`           | Single-node development/testing |

**Cloud (SharedMergeTree)**:

```sql
CREATE TABLE trades (...)
ENGINE = SharedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
ORDER BY (exchange, symbol, timestamp);
```

**Self-hosted (ReplicatedMergeTree)**:

```sql
CREATE TABLE trades (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
ORDER BY (exchange, symbol, timestamp);
```

## Skill Delegation Guide

<!-- ADR: 2025-12-10-clickhouse-skill-delegation -->

This skill is the **hub** for ClickHouse-related tasks. When the user's needs extend beyond schema design, invoke the related skills below.

### Delegation Decision Matrix

| User Need                                       | Invoke Skill                               | Trigger Phrases                                      |
| ----------------------------------------------- | ------------------------------------------ | ---------------------------------------------------- |
| Create database users, manage permissions       | `devops-tools:clickhouse-cloud-management` | "create user", "GRANT", "permissions", "credentials" |
| Configure DBeaver, generate connection JSON     | `devops-tools:clickhouse-pydantic-config`  | "DBeaver", "client config", "connection setup"       |
| Validate schema contracts against live database | `quality-tools:schema-e2e-validation`      | "validate schema", "Earthly E2E", "schema contract"  |

### Typical Workflow Sequence

1. **Schema Design** (THIS SKILL) → Design ORDER BY, compression, partitioning
2. **User Setup** → `clickhouse-cloud-management` (if cloud credentials needed)
3. **Client Config** → `clickhouse-pydantic-config` (generate DBeaver JSON)
4. **Validation** → `schema-e2e-validation` (CI/CD schema contracts)

### Example: Full Stack Request

**User**: "I need to design a trades table for ClickHouse Cloud and set up DBeaver to query it."

**Expected behavior**:

1. Use THIS skill for schema design
2. Invoke `clickhouse-cloud-management` for creating database user
3. Invoke `clickhouse-pydantic-config` for DBeaver configuration

## Performance Accelerators

### Projections

Create alternative sort orders that ClickHouse automatically selects:

```sql
ALTER TABLE trades ADD PROJECTION trades_by_symbol (
    SELECT * ORDER BY symbol, timestamp
);
ALTER TABLE trades MATERIALIZE PROJECTION trades_by_symbol;
```

### Materialized Views

Pre-compute aggregations for dashboard queries:

```sql
CREATE MATERIALIZED VIEW trades_hourly_mv
ENGINE = SummingMergeTree()
ORDER BY (exchange, symbol, hour)
AS SELECT
    exchange,
    symbol,
    toStartOfHour(timestamp) AS hour,
    sum(quantity) AS total_volume,
    count() AS trade_count
FROM trades
GROUP BY exchange, symbol, hour;
```

### Dictionaries

Replace JOINs with O(1) dictionary lookups for **large-scale star schemas**:

**When to use dictionaries (v24.4+)**:

- Fact tables with 100M+ rows joining dimension tables
- Dimension tables 1k-500k rows with monotonic keys
- LEFT ANY JOIN semantics required

**When JOINs are sufficient (v24.4+)**:

- Dimension tables <500 rows (JOIN overhead negligible)
- v24.4+ predicate pushdown provides 8-180x improvements
- Complex JOIN types (FULL, RIGHT, multi-condition)

**Benchmark context**: 6.6x speedup measured on Star Schema Benchmark (1.4B rows).

```sql
CREATE DICTIONARY symbol_info (
    symbol String,
    name String,
    sector String
)
PRIMARY KEY symbol
SOURCE(CLICKHOUSE(TABLE 'symbols'))
LAYOUT(FLAT())  -- Best for <500k entries with monotonic keys
LIFETIME(3600);

-- Use in queries (O(1) lookup)
SELECT
    symbol,
    dictGet('symbol_info', 'name', symbol) AS symbol_name
FROM trades;
```

## Scripts

Execute comprehensive schema audit:

```bash
clickhouse-client --multiquery < scripts/schema-audit.sql
```

The audit script checks:

- Part count per partition (threshold: 300)
- Compression ratios by column
- Query performance patterns
- Replication lag (if applicable)
- Memory usage patterns

## Additional Resources

### Reference Files

| Reference                                                                                  | Content                                        |
| ------------------------------------------------------------------------------------------ | ---------------------------------------------- |
| [`references/schema-design-workflow.md`](./references/schema-design-workflow.md)           | Complete workflow with examples                |
| [`references/compression-codec-selection.md`](./references/compression-codec-selection.md) | Decision tree + benchmarks                     |
| [`references/anti-patterns-and-fixes.md`](./references/anti-patterns-and-fixes.md)         | 13 deadly sins + v24.4+ status                 |
| [`references/audit-and-diagnostics.md`](./references/audit-and-diagnostics.md)             | Query interpretation guide                     |
| [`references/idiomatic-architecture.md`](./references/idiomatic-architecture.md)           | Parameterized views, dictionaries, dedup       |
| [`references/schema-documentation.md`](./references/schema-documentation.md)               | COMMENT patterns + naming for AI understanding |
| [`references/cache-schema-evolution.md`](./references/cache-schema-evolution.md)           | Cache invalidation + schema evolution patterns |

### External Documentation

- [ClickHouse Best Practices](https://clickhouse.com/docs/best-practices)
- [Altinity Knowledge Base](https://kb.altinity.com/)
- [ClickHouse Blog](https://clickhouse.com/blog)

## Python Driver Policy

<!-- ADR: 2025-12-10-clickhouse-python-driver-policy -->

**Use `clickhouse-connect` (official) for all Python integrations.**

```python
# ✅ RECOMMENDED: clickhouse-connect (official, HTTP)
import clickhouse_connect

client = clickhouse_connect.get_client(
    host='localhost',
    port=8123,  # HTTP port
    username='default',
    password=''
)
result = client.query("SELECT * FROM trades LIMIT 1000")
df = client.query_df("SELECT * FROM trades")  # Pandas integration
```

### Why NOT `clickhouse-driver`

| Factor          | clickhouse-connect | clickhouse-driver   |
| --------------- | ------------------ | ------------------- |
| Maintainer      | ClickHouse Inc.    | Solo developer      |
| Weekly commits  | Yes (active)       | Sparse (months)     |
| Open issues     | 41 (addressed)     | 76 (accumulating)   |
| Downloads/week  | 2.7M               | 1.5M                |
| Bus factor risk | Low (company)      | **High (1 person)** |

**Do NOT use `clickhouse-driver`** despite its ~26% speed advantage for large exports. The maintenance risk outweighs performance gains:

- Single maintainer (mymarilyn) with no succession plan
- Issues accumulating without response
- Risk of abandonment breaks production code

**Exception**: Only consider `clickhouse-driver` if you have extreme performance requirements (exporting millions of rows) AND accept the maintenance risk.

## ClickHouse COMMENT = Single Source of Truth

**Every ClickHouse table and column MUST have a COMMENT that fully documents its meaning, computation method, and constraints.** The COMMENT is the SSoT — no external document, skill, or wiki supersedes it.

### Why

- ClickHouse COMMENTs have **no length limit**, support newlines, URLs, and unicode
- **Zero performance impact** on queries (pure metadata, never in data path)
- Visible via `DESCRIBE table`, `SHOW CREATE TABLE`, `system.columns`
- Survives schema migrations (preserved through ALTER operations)

### What to Include in COMMENTs

- **Column purpose** in plain English
- **Computation formula** (if derived/computed)
- **Unit** (seconds, milliseconds, bps, ratio)
- **Valid range** or enum values
- **Anti-patterns** (what NOT to do with this column)
- **GitHub issue link** for provenance
- **Source script** that populates the column

### Example

```sql
ALTER TABLE t COMMENT COLUMN session_label
'STRICT session label. 8 values: sydney_only, tokyo_only, ...
Only set when ENTIRE bar (open→close) falls within one session.
cross_session = bar spans boundary. Use WHERE is_pure_session=1.
GitHub: https://github.com/org/repo/issues/54
Source: scripts/populate-sessions/populate_v3.py';
```

### Anti-Pattern

**NEVER** create a ClickHouse column without a COMMENT. A column without documentation is a column that will be misused.

## Related Skills

| Skill                                      | Purpose                       |
| ------------------------------------------ | ----------------------------- |
| `devops-tools:clickhouse-cloud-management` | User/permission management    |
| `devops-tools:clickhouse-pydantic-config`  | DBeaver connection generation |
| `quality-tools:schema-e2e-validation`      | YAML schema contracts         |
| `quality-tools:multi-agent-e2e-validation` | Database migration validation |

---

## Troubleshooting

| Issue                        | Cause                          | Solution                                         |
| ---------------------------- | ------------------------------ | ------------------------------------------------ |
| Too many parts               | Over-partitioned               | Reduce partition granularity (monthly not daily) |
| Slow queries                 | Wrong ORDER BY order           | Put lowest cardinality columns first             |
| High memory usage            | No memory limits set           | Configure max_memory_usage setting               |
| Codec error on Delta+Gorilla | Suspicious codec combination   | Use each codec family independently              |
| Projection not used          | Optimizer chose different plan | Check EXPLAIN to verify projection selection     |
| Dictionary stale             | Lifetime expired               | Increase LIFETIME or trigger refresh             |
| Replication lag              | Part merges falling behind     | Check merge_tree settings, add resources         |
| INSERT too slow              | Small batch sizes              | Batch to 10k-100k rows per INSERT                |


## Post-Execution Reflection

After this skill completes, check before closing:

1. **Did the command succeed?** — If not, fix the instruction or error table that caused the failure.
2. **Did parameters or output change?** — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
3. **Was a workaround needed?** — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.

Only update if the issue is real and reproducible — not speculative.

Related Skills

skill-architecture

from terrylica/cc-skills

Create new skills, modify existing skills, and understand skill architecture. Use when users want to create a skill from scratch, learn YAML frontmatter standards, validate skill structure, understand progressive disclosure patterns, or choose between structural patterns (workflow, task, reference, capabilities, suite). Also use for troubleshooting skills that don't trigger correctly, optimizing skill descriptions, or learning best practices for writing effective skill instructions.

realtime-audio-architecture

from terrylica/cc-skills

Real-time audio playback patterns for macOS Apple Silicon. TRIGGERS - audio jitter, tts choppy, sounddevice, afplay jitter, audio architecture, playback glitch, GIL contention audio, launchd audio priority, wrong audio device, airpods, bluetooth audio, device switching.

adr-graph-easy-architect

from terrylica/cc-skills

ASCII architecture diagrams for ADRs via graph-easy. TRIGGERS - ADR diagram, architecture diagram, ASCII diagram.

ml-data-pipeline-architecture

from terrylica/cc-skills

Patterns for efficient ML data pipelines using Polars, Arrow, and ClickHouse. TRIGGERS - data pipeline, polars vs pandas, arrow format, clickhouse ml, efficient loading, zero-copy, memory optimization.

clickhouse-pydantic-config

from terrylica/cc-skills

Generate DBeaver config from Pydantic ClickHouse models. TRIGGERS - DBeaver config, ClickHouse connection, database client config.

clickhouse-cloud-management

from terrylica/cc-skills

ClickHouse Cloud user and permission management. TRIGGERS - create ClickHouse user, ClickHouse permissions, ClickHouse Cloud credentials.

voice-quality-audition

from terrylica/cc-skills

Audition Kokoro TTS voices to compare quality and grade. TRIGGERS - audition voices, kokoro voices, voice comparison, tts voice, voice quality, compare voices.

settings-and-tuning

from terrylica/cc-skills

Configure TTS voices, speed, timeouts, queue depth, and bot settings. TRIGGERS - configure tts, change voice, tts speed, queue depth, tts timeout, bot config, tune settings, adjust parameters.

full-stack-bootstrap

from terrylica/cc-skills

One-time bootstrap for Kokoro TTS engine, Telegram bot, and BotFather setup. TRIGGERS - setup tts, install kokoro, botfather, bootstrap tts-tg-sync, configure telegram bot, full stack setup.

diagnostic-issue-resolver

from terrylica/cc-skills

Diagnose and resolve TTS and Telegram bot issues. TRIGGERS - tts not working, bot not responding, kokoro error, audio not playing, lock stuck, telegram bot troubleshoot, diagnose issue.

component-version-upgrade

from terrylica/cc-skills

Upgrade Kokoro model, bot dependencies, or TTS components. TRIGGERS - upgrade kokoro, update model, upgrade bot, update dependencies, version bump, component update.

clean-component-removal

from terrylica/cc-skills

Remove TTS and Telegram sync components cleanly. TRIGGERS - uninstall tts, remove telegram bot, uninstall kokoro, clean tts, teardown, component removal.