using-chdb
Guide for using chdb, an in-process SQL OLAP engine powered by ClickHouse. Covers pandas-compatible DataStore API, 16+ data sources (MySQL, PostgreSQL, S3, ClickHouse, MongoDB, Iceberg, Delta Lake, etc.), 10+ file formats, and cross-source joins. Use when the user wants to analyze data, query files, join multiple data sources, or build data integration pipelines.
Best use case
using-chdb is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Guide for using chdb, an in-process SQL OLAP engine powered by ClickHouse. Covers pandas-compatible DataStore API, 16+ data sources (MySQL, PostgreSQL, S3, ClickHouse, MongoDB, Iceberg, Delta Lake, etc.), 10+ file formats, and cross-source joins. Use when the user wants to analyze data, query files, join multiple data sources, or build data integration pipelines.
Teams using using-chdb should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/using-chdb/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How using-chdb Compares
| Feature / Agent | using-chdb | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Guide for using chdb, an in-process SQL OLAP engine powered by ClickHouse. Covers pandas-compatible DataStore API, 16+ data sources (MySQL, PostgreSQL, S3, ClickHouse, MongoDB, Iceberg, Delta Lake, etc.), 10+ file formats, and cross-source joins. Use when the user wants to analyze data, query files, join multiple data sources, or build data integration pipelines.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# chdb — Pandas-Compatible Multi-Source Data Analytics
chdb is an in-process ClickHouse engine for Python. Write familiar pandas code, query 16+ data sources and 10+ file formats, join them freely — no server, no ETL, no data movement.
```bash
pip install chdb
```
## Why chdb
- **Drop-in pandas replacement**: `import datastore as pd` — same API, ClickHouse performance
- **16+ data sources as first-class citizens**: local files, S3, GCS, Azure, HDFS, MySQL, PostgreSQL, ClickHouse, MongoDB, SQLite, Redis, Iceberg, Delta Lake, Hudi, HTTP URLs
- **10+ file formats**: Parquet, CSV, TSV, JSON, JSONLines, Arrow, ORC, Avro, XML — auto-detected by extension
- **Cross-source joins**: join a MySQL table with an S3 Parquet file and a local CSV in one expression
- **Lazy evaluation**: operations compile to optimized SQL, execute only when results are needed
## DataStore: Pandas API on Any Data Source
### Connecting to data — always the same pattern
```python
from datastore import DataStore
# Local files (format auto-detected: .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)
ds = DataStore.from_file("sales.parquet")
ds = DataStore.from_file("logs/*.csv") # glob patterns
# Cloud storage
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = DataStore.from_s3("s3://private/data.parquet", access_key_id="KEY", secret_access_key="SECRET")
ds = DataStore.from_gcs("gs://bucket/data.parquet", nosign=True)
ds = DataStore.from_azure(connection_string="...", container="data", path="events.parquet")
ds = DataStore.from_hdfs("hdfs://namenode:9000/warehouse/*.parquet")
ds = DataStore.from_url("https://example.com/data.csv")
# Databases
ds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
ds = DataStore.from_postgresql(host="pg:5432", database="analytics", table="events", user="user", password="pass")
ds = DataStore.from_clickhouse(host="ch:9000", database="logs", table="access_log")
ds = DataStore.from_mongodb(host="mongo:27017", database="app", collection="users", user="user", password="pass")
ds = DataStore.from_sqlite("/data/local.db", "users")
# Data lake formats
ds = DataStore.from_iceberg("s3://warehouse/iceberg/events", access_key_id="KEY", secret_access_key="SECRET")
ds = DataStore.from_delta("s3://warehouse/delta/transactions", access_key_id="KEY", secret_access_key="SECRET")
ds = DataStore.from_hudi("s3://warehouse/hudi/logs", access_key_id="KEY", secret_access_key="SECRET")
# URI shorthand — auto-detect source and format
ds = DataStore.uri("s3://bucket/data.parquet?nosign=true")
ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")
ds = DataStore.uri("postgresql://user:pass@pg:5432/analytics/events")
ds = DataStore.uri("clickhouse://ch:9440/logs/access_log?user=default")
ds = DataStore.uri("mongodb://user:pass@mongo:27017/app.users")
ds = DataStore.uri("deltalake:///data/delta/events")
# In-memory from dict or DataFrame
ds = DataStore({"name": ["Alice", "Bob"], "age": [25, 30]})
```
### Once connected, always the same pandas API
No matter where the data lives, the operations are identical:
```python
# Filter
result = ds[ds["age"] > 25]
result = ds[(ds["status"] == "active") & (ds["revenue"] > 1000)]
# Select columns
result = ds[["name", "city", "revenue"]]
# Sort
result = ds.sort_values("revenue", ascending=False)
# GroupBy + aggregation
result = ds.groupby("department")["salary"].mean()
result = ds.groupby(["region", "product"]).agg({"revenue": "sum", "quantity": "mean"})
# Add computed columns
result = ds.assign(profit=ds["revenue"] - ds["cost"], margin=lambda x: x["profit"] / x["revenue"])
# String and datetime accessors
ds["name"].str.upper()
ds["email"].str.contains("@gmail")
ds["order_date"].dt.year
ds["order_date"].dt.month
# Inspection
ds.columns # column names
ds.shape # (rows, cols)
ds.head(10) # first 10 rows
ds.describe() # statistics
ds.to_sql() # view the generated SQL behind the scenes
```
### Cross-source joins — the killer feature
Join data across completely different sources with one expression:
```python
from datastore import DataStore
# Three different sources
customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")
orders = DataStore.from_file("orders.parquet")
reviews = DataStore.from_s3("s3://feedback/reviews.parquet", nosign=True)
# Join them all with pandas syntax
result = (orders
.join(customers, left_on="customer_id", right_on="id")
.join(reviews, on="product_id")
.groupby("country")
.agg({"amount": "sum", "rating": "mean", "review_id": "count"})
.sort_values("amount", ascending=False)
)
print(result)
```
### Writing data across sources
```python
source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
target = DataStore("file", path="output/summary.parquet", format="Parquet")
target.insert_into("category", "total", "count").select_from(
source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")
).execute()
```
## Raw SQL: Direct ClickHouse Power
For complex analytics or when you prefer SQL:
```python
import chdb
# Query any file
chdb.query("SELECT * FROM file('data.parquet', Parquet) WHERE price > 100 LIMIT 10")
# Query databases directly
chdb.query("SELECT * FROM mysql('db:3306', 'shop', 'orders', 'root', 'pass') WHERE status = 'shipped'")
chdb.query("SELECT * FROM postgresql('pg:5432', 'analytics', 'events', 'user', 'pass') ORDER BY ts DESC LIMIT 100")
# Cross-source SQL join
chdb.query("""
SELECT u.name, o.product, o.amount
FROM mysql('db:3306', 'crm', 'users', 'root', 'pass') AS u
JOIN file('orders.parquet', Parquet) AS o ON u.id = o.user_id
WHERE o.amount > 100
ORDER BY o.amount DESC
""")
# Data lake formats
chdb.query("SELECT * FROM deltaLake('s3://bucket/delta/table', NOSIGN) LIMIT 10")
chdb.query("SELECT * FROM iceberg('s3://bucket/iceberg/table', 'KEY', 'SECRET') LIMIT 10")
# Python dict/DataFrame as SQL table
data = {"name": ["Alice", "Bob"], "score": [95, 87]}
chdb.query("SELECT * FROM Python(data) ORDER BY score DESC")
# Output formats: CSV (default), JSON, DataFrame, Arrow, ArrowTable, Parquet, Pretty
df = chdb.query("SELECT * FROM numbers(10)", "DataFrame")
# Parametrized queries
chdb.query(
"SELECT toDate({d:String}) + number AS date FROM numbers({n:UInt64})",
"DataFrame",
params={"d": "2025-01-01", "n": 30}
)
```
## Session: Stateful Pipelines
```python
from chdb import session as chs
sess = chs.Session("./analytics_db") # persistent; use Session() for in-memory
# Ingest from external sources into local tables
sess.query("""
CREATE TABLE users ENGINE = MergeTree() ORDER BY id AS
SELECT * FROM mysql('db:3306', 'crm', 'users', 'root', 'pass')
""")
sess.query("""
CREATE TABLE events ENGINE = MergeTree() ORDER BY (ts, user_id) AS
SELECT * FROM s3('s3://logs/events/*.parquet', NOSIGN)
""")
# Analyze locally — fast iterative queries
sess.query("""
SELECT u.country, e.event_type, count() AS cnt, uniqExact(e.user_id) AS users
FROM events e JOIN users u ON e.user_id = u.id
WHERE e.ts >= today() - 7
GROUP BY u.country, e.event_type
ORDER BY cnt DESC
""", "Pretty").show()
sess.close()
```
## Quick Reference
- Official docs: https://clickhouse.com/docs/chdb
- API signatures and ClickHouse SQL functions: [reference.md](reference.md)
- 15 runnable examples (cross-source joins, data lakes, cloud storage, ETL pipelines): [examples.md](examples.md)Related Skills
using-superantigravity
Use when starting any conversation — establishes how to find and use skills, requiring skill check before ANY response including clarifying questions
using-neon
Guides and best practices for working with Neon Serverless Postgres. Covers getting started, local development with Neon, choosing a connection method, Neon features, authentication (@neondatabase/...
using-live-documentation
Use BEFORE implementing, writing, configuring, or setting up ANY feature involving libraries, frameworks, or complex APIs - even before reading existing code. Fetches current documentation to ensure correct usage. Triggers on third-party libraries (such as react-query, FastAPI, Django, pytest), complex standard library modules (such as subprocess, streams, pathlib, logging), and "how to" questions about library usage. Do NOT use for trivial built-ins (such as dict.get, Array.map) or pure algorithms. Load this skill first to receive guidance on finding current documentation when implementing features, exploring code, or answering library-related questions.
using-droidz
Use when starting any conversation - establishes mandatory workflows for finding and using skills in the Droidz/Factory.ai system, including reading skills before usage, following brainstorming before coding, and creating TodoWrite todos for checklists
using-dbt-for-analytics-engineering
Builds and modifies dbt models, writes SQL transformations using ref() and source(), creates tests, and validates results with dbt show. Use when doing any dbt work - building or modifying models, debugging errors, exploring unfamiliar data sources, writing tests, or evaluating impact of changes.
using-context7-for-docs
Use when researching library documentation with Context7 MCP tools for official patterns and best practices
using-skillpack-maintenance
Use when maintaining, enhancing, or modifying existing Claude Code plugins - handles skills, commands, agents, hooks, and reference sheets through systematic domain analysis, structure review, behavioral testing, and quality improvements
using-xcode-cli
Builds and manages iOS/macOS apps using xcodebuild and xcrun simctl CLI tools. Use when working with Xcode projects, running apps in simulators, managing simulator instances, taking screenshots, capturing logs, running tests, or automating builds.
using-git-worktrees
Use when starting feature work that needs isolation from current workspace or before executing implementation plans - creates isolated git worktrees with smart directory selection and safety verifi...
using-openai-platform
OpenAI SDK development with GPT-5 family, Chat Completions, Responses API, embeddings, and tool calling. Use for AI-powered applications, chatbots, agents, and semantic search.
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
vuejs-best-practices
Vue 3 and Nuxt 3 performance optimization and best practices. This skill should be used when writing, reviewing, or refactoring Vue.js code to ensure optimal performance patterns. Triggers on tasks involving Vue components, Nuxt pages, Composition API, Pinia state management, or performance improvements.