clickhouse-data-handling
Handle data lifecycle in ClickHouse — TTL expiration, data deletion (GDPR), column-level encryption, and audit logging with real ClickHouse SQL. Use when implementing data retention, GDPR deletion requests, or managing sensitive data in ClickHouse. Trigger: "clickhouse data retention", "clickhouse TTL", "clickhouse GDPR", "delete data clickhouse", "clickhouse data lifecycle", "clickhouse PII".
Best use case
clickhouse-data-handling is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Handle data lifecycle in ClickHouse — TTL expiration, data deletion (GDPR), column-level encryption, and audit logging with real ClickHouse SQL. Use when implementing data retention, GDPR deletion requests, or managing sensitive data in ClickHouse. Trigger: "clickhouse data retention", "clickhouse TTL", "clickhouse GDPR", "delete data clickhouse", "clickhouse data lifecycle", "clickhouse PII".
Teams using clickhouse-data-handling should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/clickhouse-data-handling/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How clickhouse-data-handling Compares
| Feature / Agent | clickhouse-data-handling | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Handle data lifecycle in ClickHouse — TTL expiration, data deletion (GDPR), column-level encryption, and audit logging with real ClickHouse SQL. Use when implementing data retention, GDPR deletion requests, or managing sensitive data in ClickHouse. Trigger: "clickhouse data retention", "clickhouse TTL", "clickhouse GDPR", "delete data clickhouse", "clickhouse data lifecycle", "clickhouse PII".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# ClickHouse Data Handling
## Overview
Manage the full data lifecycle in ClickHouse: TTL-based expiration, GDPR/CCPA
deletion, data masking, partition management, and audit trails.
## Prerequisites
- ClickHouse tables with data (see `clickhouse-core-workflow-a`)
- Understanding of your data retention requirements
## Instructions
### Step 1: TTL-Based Data Expiration
```sql
-- Add TTL to expire data automatically
CREATE TABLE analytics.events (
event_id UUID DEFAULT generateUUIDv4(),
event_type LowCardinality(String),
user_id UInt64,
properties String CODEC(ZSTD(3)),
created_at DateTime DEFAULT now()
)
ENGINE = MergeTree()
ORDER BY (event_type, created_at)
PARTITION BY toYYYYMM(created_at)
TTL created_at + INTERVAL 90 DAY; -- Auto-delete after 90 days
-- Add TTL to existing table
ALTER TABLE analytics.events
MODIFY TTL created_at + INTERVAL 90 DAY;
-- Tiered storage TTL (hot → cold → delete)
ALTER TABLE analytics.events
MODIFY TTL
created_at + INTERVAL 7 DAY TO VOLUME 'hot',
created_at + INTERVAL 30 DAY TO VOLUME 'cold',
created_at + INTERVAL 365 DAY DELETE;
-- Column-level TTL (null out PII after 30 days, keep the row)
ALTER TABLE analytics.events
MODIFY COLUMN email String DEFAULT ''
TTL created_at + INTERVAL 30 DAY;
-- Force TTL cleanup now (normally runs during merges)
OPTIMIZE TABLE analytics.events FINAL;
```
### Step 2: Data Deletion for GDPR/CCPA
```sql
-- Option A: Lightweight DELETE (ClickHouse 23.3+)
-- Marks rows as deleted without rewriting parts immediately
DELETE FROM analytics.events WHERE user_id = 42;
-- Option B: ALTER TABLE DELETE (mutation — rewrites parts in background)
ALTER TABLE analytics.events DELETE WHERE user_id = 42;
-- Check mutation progress
SELECT
database, table, mutation_id, command,
is_done, parts_to_do, create_time
FROM system.mutations
WHERE NOT is_done
ORDER BY create_time DESC;
-- Option C: Drop entire partitions (fastest for bulk deletion)
-- First, check what partitions exist
SELECT partition, count() AS parts, sum(rows) AS rows,
min(min_time) AS from_time, max(max_time) AS to_time
FROM system.parts
WHERE database = 'analytics' AND table = 'events' AND active
GROUP BY partition ORDER BY partition;
ALTER TABLE analytics.events DROP PARTITION '202401';
```
**Important notes on ClickHouse deletions:**
- `DELETE FROM` is lightweight but still creates mutations internally
- Mutations rewrite data parts in the background — not instant
- For GDPR compliance, use `ALTER TABLE DELETE` and verify via `system.mutations`
- Partitioned data is fastest to bulk-delete via `DROP PARTITION`
### Step 3: Data Masking and Anonymization
```sql
-- Create a view that masks PII for analyst access
CREATE VIEW analytics.events_masked AS
SELECT
event_id,
event_type,
sipHash64(user_id) AS user_id_hash, -- One-way hash
JSONExtractString(properties, 'url') AS url, -- Extract safe fields only
-- Mask email: show domain only
concat('***@', substringAfter(email, '@')) AS masked_email,
created_at
FROM analytics.events;
-- Row-level masking with dictionaries
CREATE DICTIONARY analytics.pii_allowlist (
user_id UInt64,
can_see_pii UInt8
)
PRIMARY KEY user_id
SOURCE(CLICKHOUSE(TABLE 'pii_allowlist'))
LIFETIME(MIN 300 MAX 600)
LAYOUT(FLAT());
```
### Step 4: User Data Export (DSAR)
```typescript
import { createClient } from '@clickhouse/client';
async function exportUserData(userId: number): Promise<Record<string, unknown[]>> {
const client = createClient({ url: process.env.CLICKHOUSE_HOST! });
// Export all user data from all tables
const tables = ['events', 'sessions', 'purchases'];
const result: Record<string, unknown[]> = {};
for (const table of tables) {
const rs = await client.query({
query: `SELECT * FROM analytics.${table} WHERE user_id = {uid:UInt64}`,
query_params: { uid: userId },
format: 'JSONEachRow',
});
result[table] = await rs.json();
}
return result;
}
// GDPR: Delete all user data
async function deleteUserData(userId: number): Promise<void> {
const client = createClient({ url: process.env.CLICKHOUSE_HOST! });
const tables = ['events', 'sessions', 'purchases'];
for (const table of tables) {
await client.command({
query: `ALTER TABLE analytics.${table} DELETE WHERE user_id = {uid:UInt64}`,
query_params: { uid: userId },
});
}
// Log the deletion for compliance audit trail
await client.insert({
table: 'analytics.gdpr_audit_log',
values: [{
user_id: userId,
action: 'DELETE_ALL',
tables_affected: tables.join(','),
requested_at: new Date().toISOString().replace('T', ' ').slice(0, 19),
}],
format: 'JSONEachRow',
});
}
```
### Step 5: Audit Trail Table
```sql
-- Immutable audit log (no deletes, no TTL)
CREATE TABLE analytics.audit_log (
log_id UUID DEFAULT generateUUIDv4(),
action LowCardinality(String), -- 'query', 'delete', 'export', 'schema_change'
actor String, -- User or service name
target String, -- Table or resource
details String CODEC(ZSTD(3)), -- JSON details
ip_address IPv4,
logged_at DateTime DEFAULT now()
)
ENGINE = MergeTree()
ORDER BY (action, logged_at)
PARTITION BY toYYYYMM(logged_at);
-- No TTL — audit logs must be retained
-- Query audit trail
SELECT logged_at, actor, action, target, details
FROM analytics.audit_log
WHERE action = 'DELETE_ALL'
ORDER BY logged_at DESC
LIMIT 50;
```
### Step 6: Retention Monitoring
```sql
-- Data retention overview
SELECT
database, table,
result_ttl_expression AS ttl,
formatReadableSize(sum(bytes_on_disk)) AS size,
min(p.min_time) AS oldest_data,
max(p.max_time) AS newest_data,
dateDiff('day', min(p.min_time), max(p.max_time)) AS days_span
FROM system.tables t
LEFT JOIN system.parts p ON t.database = p.database AND t.name = p.table AND p.active
WHERE t.database = 'analytics'
GROUP BY database, table, result_ttl_expression
ORDER BY sum(bytes_on_disk) DESC;
-- Find tables missing TTL
SELECT database, name AS table, engine
FROM system.tables
WHERE database = 'analytics'
AND engine LIKE '%MergeTree%'
AND result_ttl_expression = '';
```
## Data Classification
| Category | Examples | Handling in ClickHouse |
|----------|----------|------------------------|
| PII | Email, name, IP | Column-level TTL, masking views, deletion support |
| Sensitive | API keys, tokens | Never store in ClickHouse — use secret managers |
| Business | Event counts, metrics | Standard TTL, aggregate for long-term retention |
| Audit | Access logs | No TTL, immutable, partitioned by month |
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Mutation stuck | Large table rewrite | Check `system.mutations`, cancel if needed |
| TTL not expiring | No merges running | `OPTIMIZE TABLE ... FINAL` to force |
| DELETE not working | Old ClickHouse version | Use `ALTER TABLE DELETE` (mutation) |
| Export timeout | Too much user data | Add LIMIT or export in batches |
## Resources
- [TTL for Data Management](https://clickhouse.com/docs/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-ttl)
- [DELETE Statement](https://clickhouse.com/docs/sql-reference/statements/delete)
- [Mutations](https://clickhouse.com/docs/guides/developer/mutations)
## Next Steps
For role-based access control, see `clickhouse-enterprise-rbac`.Related Skills
College Football Data (CFB)
Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.
College Basketball Data (CBB)
Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.
validating-database-integrity
Process use when you need to ensure database integrity through comprehensive data validation. This skill validates data types, ranges, formats, referential integrity, and business rules. Trigger with phrases like "validate database data", "implement data validation rules", "enforce data integrity constraints", or "validate data formats".
forecasting-time-series-data
This skill enables Claude to forecast future values based on historical time series data. It analyzes time-dependent data to identify trends, seasonality, and other patterns. Use this skill when the user asks to predict future values of a time series, analyze trends in data over time, or requires insights into time-dependent data. Trigger terms include "forecast," "predict," "time series analysis," "future values," and requests involving temporal data.
generating-test-data
This skill enables Claude to generate realistic test data for software development. It uses the test-data-generator plugin to create users, products, orders, and custom schemas for comprehensive testing. Use this skill when you need to populate databases, simulate user behavior, or create fixtures for automated tests. Trigger phrases include "generate test data", "create fake users", "populate database", "generate product data", "create test orders", or "generate data based on schema". This skill is especially useful for populating testing environments or creating sample data for demonstrations.
test-data-builder
Test Data Builder - Auto-activating skill for Test Automation. Triggers on: test data builder, test data builder Part of the Test Automation skill category.
splitting-datasets
Process split datasets into training, validation, and testing sets for ML model development. Use when requesting "split dataset", "train-test split", or "data partitioning". Trigger with relevant phrases based on skill purpose.
scanning-database-security
Process use when you need to work with security and compliance. This skill provides security scanning and vulnerability detection with comprehensive guidance and automation. Trigger with phrases like "scan for vulnerabilities", "implement security controls", or "audit security".
preprocessing-data-with-automated-pipelines
Process automate data cleaning, transformation, and validation for ML tasks. Use when requesting "preprocess data", "clean data", "ETL pipeline", or "data transformation". Trigger with relevant phrases based on skill purpose.
optimizing-database-connection-pooling
Process use when you need to work with connection management. This skill provides connection pooling and management with comprehensive guidance and automation. Trigger with phrases like "manage connections", "configure pooling", or "optimize connection usage".
modeling-nosql-data
This skill enables Claude to design NoSQL data models. It activates when the user requests assistance with NoSQL database design, including schema creation, data modeling for MongoDB or DynamoDB, or defining document structures. Use this skill when the user mentions "NoSQL data model", "design MongoDB schema", "create DynamoDB table", or similar phrases related to NoSQL database architecture. It assists in understanding NoSQL modeling principles like embedding vs. referencing, access pattern optimization, and sharding key selection.
monitoring-database-transactions
Monitor use when you need to work with monitoring and observability. This skill provides health monitoring and alerting with comprehensive guidance and automation. Trigger with phrases like "monitor system health", "set up alerts", or "track metrics".