openduck-distributed-duckdb
OpenDuck — open-source distributed DuckDB with differential storage, hybrid dual execution, and transparent remote database attach
Best use case
openduck-distributed-duckdb is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
OpenDuck — open-source distributed DuckDB with differential storage, hybrid dual execution, and transparent remote database attach
Teams using openduck-distributed-duckdb should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/openduck-distributed-duckdb/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How openduck-distributed-duckdb Compares
| Feature / Agent | openduck-distributed-duckdb | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
OpenDuck — open-source distributed DuckDB with differential storage, hybrid dual execution, and transparent remote database attach
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# OpenDuck Distributed DuckDB
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection
OpenDuck is an open-source implementation of distributed DuckDB featuring differential storage (append-only immutable layers via Postgres + object store), hybrid dual execution (single queries split across local and remote workers), and transparent remote database attach via `ATTACH 'openduck:mydb'`. It is architecturally inspired by MotherDuck but fully open protocol (gRPC + Arrow IPC).
---
## Architecture Overview
```
DuckDB client (openduck extension)
└─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
└─ gRPC + Arrow IPC
└─ Gateway (Rust)
├─ auth / routing / plan splitting
├─ Worker 1 (embedded DuckDB)
└─ Worker N (embedded DuckDB)
├─ Postgres (metadata)
└─ Object store (sealed layers)
```
**Key concepts:**
- `OpenDuckCatalog` / `OpenDuckTableEntry` — remote tables appear as first-class DuckDB catalog entries
- Hybrid execution — gateway labels operators `LOCAL` or `REMOTE`, inserts `Bridge` operators at boundaries
- Differential storage — immutable sealed layers, snapshot isolation, one write path, many readers
- Protocol — only 2 gRPC RPCs defined in `proto/openduck/v1/execution.proto`
---
## Repository Layout
```
crates/
exec-gateway/ # auth, routing, hybrid plan splitting
exec-worker/ # embedded DuckDB, Arrow IPC streaming
exec-proto/ # protobuf/tonic codegen
openduck-cli/ # unified CLI (serve|gateway|worker)
diff-*/ # differential storage (layers, metadata, FUSE)
extensions/
openduck/ # DuckDB C++ extension (StorageExtension + Catalog)
clients/
python/ # pip-installable openduck wrapper
proto/
openduck/v1/ # execution.proto
```
---
## Installation & Build
### Prerequisites
- Rust toolchain (stable)
- C++ build tools, `vcpkg`, `bison` (macOS: `brew install bison`)
- DuckDB development headers (handled by the extension Makefile)
### 1. Build the Rust backend
```bash
git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspace
```
### 2. Build the DuckDB C++ extension
```bash
cd extensions/openduck
make
# Output:
# extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
```
### 3. Install the Python client (optional)
```bash
pip install -e clients/python
```
---
## Running the Server
```bash
# Required env vars
export OPENDUCK_TOKEN=your-secret-token
# Start all-in-one (gateway + worker)
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
# Or run gateway and worker separately
cargo run -p openduck-cli -- gateway --port 7878
cargo run -p openduck-cli -- worker --gateway http://localhost:7878
```
---
## Connecting from Python
### Via openduck wrapper (recommended)
```python
import openduck # auto-detects extension from build tree or OPENDUCK_EXTENSION_PATH
con = openduck.connect("mydb") # uses OPENDUCK_TOKEN env var
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()
```
### Via raw DuckDB SDK
```python
import duckdb
import os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
# e.g. extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
con.execute(
"ATTACH 'openduck:mydb"
"?endpoint=http://localhost:7878"
f"&token={os.environ[\"OPENDUCK_TOKEN\"]}' AS cloud;"
)
# Query remote table
con.sql("SELECT * FROM cloud.users LIMIT 10").show()
# Hybrid query — local table joined with remote table
con.sql("""
SELECT l.product_id, l.name, r.total_sales
FROM local.products l
JOIN cloud.sales r ON l.product_id = r.product_id
WHERE r.total_sales > 1000
""").show()
```
### Environment variables
```bash
export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878 # default
```
---
## Connecting from the CLI
```bash
duckdb -unsigned -c "
LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
SHOW ALL TABLES;
SELECT * FROM cloud.users LIMIT 5;
"
```
---
## Connecting from Rust
```rust
use duckdb::{Connection, Result};
fn main() -> Result<()> {
let conn = Connection::open_in_memory()?;
let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
let token = std::env::var("OPENDUCK_TOKEN").unwrap();
conn.execute_batch(&format!(r#"
SET allow_unsigned_extensions = true;
LOAD '{ext_path}';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
"#))?;
let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
let rows = stmt.query_map([], |row| {
Ok(row.get::<_, String>(0)?)
})?;
for row in rows {
println!("{}", row?);
}
Ok(())
}
```
---
## Hybrid Execution Pattern
Hybrid execution happens automatically — the gateway splits the logical plan:
```
[LOCAL] HashJoin(l.id = r.id)
[LOCAL] Scan(products) ← runs on your machine
[LOCAL] Bridge(REMOTE→LOCAL)
[REMOTE] Scan(sales) ← runs on remote worker
```
Write queries naturally — the extension handles routing:
```python
# This single query runs across two engines transparently
con.sql("""
SELECT
p.category,
SUM(s.amount) AS revenue
FROM local.products p -- local table
JOIN cloud.sales s -- remote table
ON p.id = s.product_id
GROUP BY p.category
ORDER BY revenue DESC
""").show()
```
---
## Differential Storage
Differential storage is managed server-side. Key properties:
- **Append-only sealed layers** stored in object storage (S3-compatible)
- **Postgres** stores layer metadata and snapshot pointers
- **Snapshot isolation** — readers always see a consistent view
- **One serialized write path** — many concurrent readers
From a client perspective it is fully transparent. DuckDB sees normal table semantics.
---
## ATTACH URL Reference
```
openduck:<database_name>?endpoint=<url>&token=<token>
```
| Parameter | Default | Description |
|------------|----------------------------|------------------------------------|
| `endpoint` | `http://localhost:7878` | Gateway URL |
| `token` | `$OPENDUCK_TOKEN` env var | Auth token matching server config |
Examples:
```sql
-- Local dev
ATTACH 'openduck:mydb?token=dev-token' AS cloud;
-- Remote server, explicit endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;
-- Alias: od: also works
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;
```
---
## DuckLake Integration
OpenDuck and DuckLake operate at different layers and complement each other:
```python
import duckdb, os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
# Attach DuckLake catalog (managed by remote worker backed by DuckLake)
con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;")
# Query DuckLake tables transparently via OpenDuck transport
con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()
# Hybrid: local scratch data joined with remote DuckLake table
con.sql("""
SELECT l.session_id, r.user_email
FROM memory.sessions l
JOIN lh.users r ON l.user_id = r.id
""").show()
```
---
## Protocol Reference
The wire protocol is intentionally minimal. See `proto/openduck/v1/execution.proto`:
- **`ExecuteQuery`** — send SQL, receive a query handle
- **`StreamResults`** — stream Arrow IPC record batches back to client
Any gRPC service implementing these two RPCs is a valid OpenDuck backend. You can replace the Rust gateway with a custom implementation in any language.
---
## Common Patterns
### Check which tables are available remotely
```sql
-- After ATTACH ... AS cloud
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';
```
### Write to a remote table
```sql
INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');
```
### Create a remote table from local data
```sql
CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;
```
### Export remote query result to local Parquet
```python
con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
.write_parquet("output/us_east.parquet")
```
---
## Troubleshooting
### `Extension is not trusted` / signature error
```python
# Always set allow_unsigned_extensions before loading
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
```
Or in CLI:
```bash
duckdb -unsigned
```
### `LOAD` fails — extension not found
```bash
# Set the env var to the exact built path
export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
ls -la $OPENDUCK_EXTENSION_PATH # confirm it exists
```
### Connection refused to gateway
```bash
# Verify server is running
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
# Default port is 7878 — check firewall / port binding
curl http://localhost:7878/health
```
### Token mismatch / auth failure
```bash
# Server token and client token must match exactly
export OPENDUCK_TOKEN=same-value-on-both-sides
# Server: cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
# Client: ATTACH '...&token=same-value-on-both-sides' AS cloud;
```
### Build fails on macOS — bison version
```bash
brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && make
```
### Extension version mismatch with DuckDB
The extension must be built against the same DuckDB version as the Python package:
```bash
python -c "import duckdb; print(duckdb.__version__)"
# Ensure the extension Makefile targets the same version
# Check extensions/openduck/Makefile for DUCKDB_VERSION
```
---
## OpenDuck vs Alternatives
| Feature | OpenDuck | Arrow Flight SQL | DuckLake |
|---|---|---|---|
| Remote attach UX | `ATTACH 'openduck:db'` | Separate driver | `ATTACH 'ducklake:...'` |
| Hybrid execution | ✅ split plan | ❌ full remote | ❌ |
| DuckDB catalog integration | ✅ native | ❌ | ✅ |
| Protocol RPCs | 2 | ~15 | N/A |
| Differential storage | ✅ | ❌ | via Parquet layers |
| Self-hosted | ✅ | ✅ | ✅ |Related Skills
```markdown
---
zeroboot-vm-sandbox
Sub-millisecond VM sandboxes for AI agents using copy-on-write KVM forking via Zeroboot
yourvpndead-vpn-detection
Android app that detects VPN/proxy servers (VLESS/xray/sing-box) via local SOCKS5 vulnerability, exposing exit IPs and server configs without root
xata-postgres-platform
Expert skill for Xata open-source cloud-native Postgres platform with copy-on-write branching, scale-to-zero, and Kubernetes deployment
x-mentor-skill-nuwa
AI-powered X (Twitter) content strategy skill that distills methodologies from 6 top creators + open-source algorithm data into actionable writing, growth, and monetization guidance.
wx-favorites-report
End-to-end pipeline to extract, decrypt, and visualize WeChat Mac favorites from encrypted SQLite DB into an interactive HTML report.
wterm-web-terminal
Web terminal emulator with Zig/WASM core, DOM rendering, and React/vanilla JS bindings
worldmonitor-intelligence-dashboard
Real-time global intelligence dashboard with AI-powered news aggregation, geopolitical monitoring, and infrastructure tracking
witr-process-inspector
CLI and TUI tool that explains why processes, services, and ports are running by tracing causality chains across supervisors, containers, and shells.
wildworld-dataset
WildWorld large-scale action-conditioned world modeling dataset with 108M+ frames from a photorealistic ARPG game, featuring per-frame annotations, 450+ actions, and explicit state information for generative world modeling research.
whatcable-macos-usb-inspector
macOS menu bar app that identifies USB-C cable capabilities and charging diagnostics using IOKit
wewrite-wechat-ai-publishing
Full-pipeline AI skill for WeChat Official Account articles — hotspot fetching, topic selection, writing, SEO, image generation, formatting, and draft box publishing.