using-chdb

Guide for using chdb, an in-process SQL OLAP engine powered by ClickHouse. Covers pandas-compatible DataStore API, 16+ data sources (MySQL, PostgreSQL, S3, ClickHouse, MongoDB, Iceberg, Delta Lake, etc.), 10+ file formats, and cross-source joins. Use when the user wants to analyze data, query files, join multiple data sources, or build data integration pipelines.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

using-chdb is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using using-chdb should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/using-chdb/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/development/using-chdb/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/using-chdb/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How using-chdb Compares

Feature / Agent	using-chdb	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# chdb — Pandas-Compatible Multi-Source Data Analytics

chdb is an in-process ClickHouse engine for Python. Write familiar pandas code, query 16+ data sources and 10+ file formats, join them freely — no server, no ETL, no data movement.

```bash
pip install chdb
```

## Why chdb

- **Drop-in pandas replacement**: `import datastore as pd` — same API, ClickHouse performance
- **16+ data sources as first-class citizens**: local files, S3, GCS, Azure, HDFS, MySQL, PostgreSQL, ClickHouse, MongoDB, SQLite, Redis, Iceberg, Delta Lake, Hudi, HTTP URLs
- **10+ file formats**: Parquet, CSV, TSV, JSON, JSONLines, Arrow, ORC, Avro, XML — auto-detected by extension
- **Cross-source joins**: join a MySQL table with an S3 Parquet file and a local CSV in one expression
- **Lazy evaluation**: operations compile to optimized SQL, execute only when results are needed

## DataStore: Pandas API on Any Data Source

### Connecting to data — always the same pattern

```python
from datastore import DataStore

# Local files (format auto-detected: .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)
ds = DataStore.from_file("sales.parquet")
ds = DataStore.from_file("logs/*.csv")          # glob patterns

# Cloud storage
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = DataStore.from_s3("s3://private/data.parquet", access_key_id="KEY", secret_access_key="SECRET")
ds = DataStore.from_gcs("gs://bucket/data.parquet", nosign=True)
ds = DataStore.from_azure(connection_string="...", container="data", path="events.parquet")
ds = DataStore.from_hdfs("hdfs://namenode:9000/warehouse/*.parquet")
ds = DataStore.from_url("https://example.com/data.csv")

# Databases
ds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
ds = DataStore.from_postgresql(host="pg:5432", database="analytics", table="events", user="user", password="pass")
ds = DataStore.from_clickhouse(host="ch:9000", database="logs", table="access_log")
ds = DataStore.from_mongodb(host="mongo:27017", database="app", collection="users", user="user", password="pass")
ds = DataStore.from_sqlite("/data/local.db", "users")

# Data lake formats
ds = DataStore.from_iceberg("s3://warehouse/iceberg/events", access_key_id="KEY", secret_access_key="SECRET")
ds = DataStore.from_delta("s3://warehouse/delta/transactions", access_key_id="KEY", secret_access_key="SECRET")
ds = DataStore.from_hudi("s3://warehouse/hudi/logs", access_key_id="KEY", secret_access_key="SECRET")

# URI shorthand — auto-detect source and format
ds = DataStore.uri("s3://bucket/data.parquet?nosign=true")
ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")
ds = DataStore.uri("postgresql://user:pass@pg:5432/analytics/events")
ds = DataStore.uri("clickhouse://ch:9440/logs/access_log?user=default")
ds = DataStore.uri("mongodb://user:pass@mongo:27017/app.users")
ds = DataStore.uri("deltalake:///data/delta/events")

# In-memory from dict or DataFrame
ds = DataStore({"name": ["Alice", "Bob"], "age": [25, 30]})
```

### Once connected, always the same pandas API

No matter where the data lives, the operations are identical:

```python
# Filter
result = ds[ds["age"] > 25]
result = ds[(ds["status"] == "active") & (ds["revenue"] > 1000)]

# Select columns
result = ds[["name", "city", "revenue"]]

# Sort
result = ds.sort_values("revenue", ascending=False)

# GroupBy + aggregation
result = ds.groupby("department")["salary"].mean()
result = ds.groupby(["region", "product"]).agg({"revenue": "sum", "quantity": "mean"})

# Add computed columns
result = ds.assign(profit=ds["revenue"] - ds["cost"], margin=lambda x: x["profit"] / x["revenue"])

# String and datetime accessors
ds["name"].str.upper()
ds["email"].str.contains("@gmail")
ds["order_date"].dt.year
ds["order_date"].dt.month

# Inspection
ds.columns        # column names
ds.shape           # (rows, cols)
ds.head(10)        # first 10 rows
ds.describe()      # statistics
ds.to_sql()        # view the generated SQL behind the scenes
```

### Cross-source joins — the killer feature

Join data across completely different sources with one expression:

```python
from datastore import DataStore

# Three different sources
customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")
orders = DataStore.from_file("orders.parquet")
reviews = DataStore.from_s3("s3://feedback/reviews.parquet", nosign=True)

# Join them all with pandas syntax
result = (orders
    .join(customers, left_on="customer_id", right_on="id")
    .join(reviews, on="product_id")
    .groupby("country")
    .agg({"amount": "sum", "rating": "mean", "review_id": "count"})
    .sort_values("amount", ascending=False)
)
print(result)
```

### Writing data across sources

```python
source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
target = DataStore("file", path="output/summary.parquet", format="Parquet")

target.insert_into("category", "total", "count").select_from(
    source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")
).execute()
```

## Raw SQL: Direct ClickHouse Power

For complex analytics or when you prefer SQL:

```python
import chdb

# Query any file
chdb.query("SELECT * FROM file('data.parquet', Parquet) WHERE price > 100 LIMIT 10")

# Query databases directly
chdb.query("SELECT * FROM mysql('db:3306', 'shop', 'orders', 'root', 'pass') WHERE status = 'shipped'")
chdb.query("SELECT * FROM postgresql('pg:5432', 'analytics', 'events', 'user', 'pass') ORDER BY ts DESC LIMIT 100")

# Cross-source SQL join
chdb.query("""
    SELECT u.name, o.product, o.amount
    FROM mysql('db:3306', 'crm', 'users', 'root', 'pass') AS u
    JOIN file('orders.parquet', Parquet) AS o ON u.id = o.user_id
    WHERE o.amount > 100
    ORDER BY o.amount DESC
""")

# Data lake formats
chdb.query("SELECT * FROM deltaLake('s3://bucket/delta/table', NOSIGN) LIMIT 10")
chdb.query("SELECT * FROM iceberg('s3://bucket/iceberg/table', 'KEY', 'SECRET') LIMIT 10")

# Python dict/DataFrame as SQL table
data = {"name": ["Alice", "Bob"], "score": [95, 87]}
chdb.query("SELECT * FROM Python(data) ORDER BY score DESC")

# Output formats: CSV (default), JSON, DataFrame, Arrow, ArrowTable, Parquet, Pretty
df = chdb.query("SELECT * FROM numbers(10)", "DataFrame")

# Parametrized queries
chdb.query(
    "SELECT toDate({d:String}) + number AS date FROM numbers({n:UInt64})",
    "DataFrame",
    params={"d": "2025-01-01", "n": 30}
)
```

## Session: Stateful Pipelines

```python
from chdb import session as chs

sess = chs.Session("./analytics_db")   # persistent; use Session() for in-memory

# Ingest from external sources into local tables
sess.query("""
    CREATE TABLE users ENGINE = MergeTree() ORDER BY id AS
    SELECT * FROM mysql('db:3306', 'crm', 'users', 'root', 'pass')
""")
sess.query("""
    CREATE TABLE events ENGINE = MergeTree() ORDER BY (ts, user_id) AS
    SELECT * FROM s3('s3://logs/events/*.parquet', NOSIGN)
""")

# Analyze locally — fast iterative queries
sess.query("""
    SELECT u.country, e.event_type, count() AS cnt, uniqExact(e.user_id) AS users
    FROM events e JOIN users u ON e.user_id = u.id
    WHERE e.ts >= today() - 7
    GROUP BY u.country, e.event_type
    ORDER BY cnt DESC
""", "Pretty").show()

sess.close()
```

## Quick Reference

- Official docs: https://clickhouse.com/docs/chdb
- API signatures and ClickHouse SQL functions: [reference.md](reference.md)
- 15 runnable examples (cross-source joins, data lakes, cloud storage, ETL pipelines): [examples.md](examples.md)

Related Skills

using-superantigravity

from diegosouzapw/awesome-omni-skill

Use when starting any conversation — establishes how to find and use skills, requiring skill check before ANY response including clarifying questions

using-neon

from diegosouzapw/awesome-omni-skill

Guides and best practices for working with Neon Serverless Postgres. Covers getting started, local development with Neon, choosing a connection method, Neon features, authentication (@neondatabase/...

using-live-documentation

from diegosouzapw/awesome-omni-skill

Use BEFORE implementing, writing, configuring, or setting up ANY feature involving libraries, frameworks, or complex APIs - even before reading existing code. Fetches current documentation to ensure correct usage. Triggers on third-party libraries (such as react-query, FastAPI, Django, pytest), complex standard library modules (such as subprocess, streams, pathlib, logging), and "how to" questions about library usage. Do NOT use for trivial built-ins (such as dict.get, Array.map) or pure algorithms. Load this skill first to receive guidance on finding current documentation when implementing features, exploring code, or answering library-related questions.

using-droidz

from diegosouzapw/awesome-omni-skill

Use when starting any conversation - establishes mandatory workflows for finding and using skills in the Droidz/Factory.ai system, including reading skills before usage, following brainstorming before coding, and creating TodoWrite todos for checklists

using-dbt-for-analytics-engineering

from diegosouzapw/awesome-omni-skill

Builds and modifies dbt models, writes SQL transformations using ref() and source(), creates tests, and validates results with dbt show. Use when doing any dbt work - building or modifying models, debugging errors, exploring unfamiliar data sources, writing tests, or evaluating impact of changes.

using-context7-for-docs

from diegosouzapw/awesome-omni-skill

Use when researching library documentation with Context7 MCP tools for official patterns and best practices

using-skillpack-maintenance

from diegosouzapw/awesome-omni-skill

Use when maintaining, enhancing, or modifying existing Claude Code plugins - handles skills, commands, agents, hooks, and reference sheets through systematic domain analysis, structure review, behavioral testing, and quality improvements

using-xcode-cli

from diegosouzapw/awesome-omni-skill

Builds and manages iOS/macOS apps using xcodebuild and xcrun simctl CLI tools. Use when working with Xcode projects, running apps in simulators, managing simulator instances, taking screenshots, capturing logs, running tests, or automating builds.

using-git-worktrees

from diegosouzapw/awesome-omni-skill

Use when starting feature work that needs isolation from current workspace or before executing implementation plans - creates isolated git worktrees with smart directory selection and safety verifi...

using-openai-platform

from diegosouzapw/awesome-omni-skill

OpenAI SDK development with GPT-5 family, Chat Completions, Responses API, embeddings, and tool calling. Use for AI-powered applications, chatbots, agents, and semantic search.

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

vuejs-best-practices

from diegosouzapw/awesome-omni-skill

Vue 3 and Nuxt 3 performance optimization and best practices. This skill should be used when writing, reviewing, or refactoring Vue.js code to ensure optimal performance patterns. Triggers on tasks involving Vue components, Nuxt pages, Composition API, Pinia state management, or performance improvements.