init
Initialize warehouse schema discovery. Generates .astro/warehouse.md with all table metadata for instant lookups. Run once per project, refresh when schema changes. Use when user says "/data:init" or asks to set up data discovery.
Best use case
init is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Initialize warehouse schema discovery. Generates .astro/warehouse.md with all table metadata for instant lookups. Run once per project, refresh when schema changes. Use when user says "/data:init" or asks to set up data discovery.
Teams using init should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/init/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How init Compares
| Feature / Agent | init | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Initialize warehouse schema discovery. Generates .astro/warehouse.md with all table metadata for instant lookups. Run once per project, refresh when schema changes. Use when user says "/data:init" or asks to set up data discovery.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Initialize Warehouse Schema
Generate a comprehensive, user-editable schema reference file for the data warehouse.
**Scripts:** `$CLAUDE_PLUGIN_ROOT/skills/analyzing-data/scripts/`
## What This Does
1. Discovers all databases, schemas, tables, and columns from the warehouse
2. **Enriches with codebase context** (dbt models, gusty SQL, schema docs)
3. Records row counts and identifies large tables
4. Generates `.astro/warehouse.md` - a version-controllable, team-shareable reference
5. Enables instant concept→table lookups without warehouse queries
## Process
### Step 1: Read Warehouse Configuration
```bash
cat ~/.astro/ai/config/warehouse.yml
```
Get the list of databases to discover (e.g., `databases: [HQ, ANALYTICS, RAW]`).
### Step 2: Search Codebase for Context (Parallel)
**Launch a subagent to find business context in code:**
```
Task(
subagent_type="Explore",
prompt="""
Search for data model documentation in the codebase:
1. dbt models: **/models/**/*.yml, **/schema.yml
- Extract table descriptions, column descriptions
- Note primary keys and tests
2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
- Parse frontmatter for: description, primary_key, tests
- Note schema mappings
3. AGENTS.md or CLAUDE.md files with data layer documentation
Return a mapping of:
table_name -> {description, primary_key, important_columns, layer}
"""
)
```
### Step 3: Parallel Warehouse Discovery
**Launch one subagent per database** using the Task tool:
```
For each database in configured_databases:
Task(
subagent_type="general-purpose",
prompt="""
Discover all metadata for database {DATABASE}.
Use the CLI to run SQL queries:
# Scripts are relative to ../analyzing-data/
uv run scripts/cli.py exec "df = run_sql('...')"
uv run scripts/cli.py exec "print(df)"
1. Query schemas:
SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA
2. Query tables with row counts:
SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
ORDER BY TABLE_SCHEMA, TABLE_NAME
3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'X'
Return a structured summary:
- Database name
- List of schemas with table counts
- For each table: name, row_count, key columns
- Flag any tables with >100M rows as "large"
"""
)
```
**Run all subagents in parallel** (single message with multiple Task calls).
### Step 4: Discover Categorical Value Families
For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:
```bash
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"
```
Group related values into families by common prefix/suffix (e.g., `Export*` for ExportCSV, ExportJSON, ExportParquet).
### Step 5: Merge Results
Combine warehouse metadata + codebase context:
1. **Quick Reference table** - concept → table mappings (pre-populated from code if found)
2. **Categorical Columns** - value families for key filter columns
3. **Database sections** - one per database
4. **Schema subsections** - tables grouped by schema
5. **Table details** - columns, row counts, **descriptions from code**, warnings
### Step 6: Generate warehouse.md
Write the file to:
- `.astro/warehouse.md` (default - project-specific, version-controllable)
- `~/.astro/ai/config/warehouse.md` (if `--global` flag)
## Output Format
```markdown
# Warehouse Schema
> Generated by `/data:init` on {DATE}. Edit freely to add business context.
## Quick Reference
| Concept | Table | Key Column | Date Column |
|---------|-------|------------|-------------|
| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |
<!-- Add your concept mappings here -->
## Categorical Columns
When filtering on these columns, explore value families first (values often have variants):
| Table | Column | Value Families |
|-------|--------|----------------|
| {TABLE} | {COLUMN} | `{PREFIX}*` ({VALUE1}, {VALUE2}, ...) |
<!-- Populated by /data:init from actual warehouse data -->
## Data Layer Hierarchy
Query downstream first: `reporting` > `mart_*` > `metric_*` > `model_*` > `IN_*`
| Layer | Prefix | Purpose |
|-------|--------|---------|
| Reporting | `reporting.*` | Dashboard-optimized |
| Mart | `mart_*` | Combined analytics |
| Metric | `metric_*` | KPIs at various grains |
| Model | `model_*` | Cleansed sources of truth |
| Raw | `IN_*` | Source data - avoid |
## {DATABASE} Database
### {SCHEMA} Schema
#### {TABLE_NAME}
{DESCRIPTION from code if found}
| Column | Type | Description |
|--------|------|-------------|
| COL1 | VARCHAR | {from code or inferred} |
- **Rows:** {ROW_COUNT}
- **Key column:** {PRIMARY_KEY from code or inferred}
{IF ROW_COUNT > 100M: - **⚠️ WARNING:** Large table - always add date filters}
## Relationships
```
{Inferred relationships based on column names like *_ID}
```
```
## Command Options
| Option | Effect |
|--------|--------|
| `/data:init` | Generate .astro/warehouse.md |
| `/data:init --refresh` | Regenerate, preserving user edits |
| `/data:init --database HQ` | Only discover specific database |
| `/data:init --global` | Write to ~/.astro/ai/config/ instead |
### Step 7: Pre-populate Cache
After generating warehouse.md, populate the concept cache:
```bash
# Scripts are relative to ../analyzing-data/
uv run cli.py concept import -p .astro/warehouse.md
uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
```
### Step 8: Offer CLAUDE.md Integration (Ask User)
**Ask the user:**
> Would you like to add the Quick Reference table to your CLAUDE.md file?
>
> This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.
>
> Options:
> 1. **Yes, add to CLAUDE.md** (Recommended) - Append Quick Reference section
> 2. **No, skip** - Use warehouse.md and cache only
**If user chooses Yes:**
1. Check if `.claude/CLAUDE.md` or `CLAUDE.md` exists
2. If exists, append the Quick Reference section (avoid duplicates)
3. If not exists, create `.claude/CLAUDE.md` with just the Quick Reference
**Quick Reference section to add:**
```markdown
## Data Warehouse Quick Reference
When querying the warehouse, use these table mappings:
| Concept | Table | Key Column | Date Column |
|---------|-------|------------|-------------|
{rows from warehouse.md Quick Reference}
**Large tables (always filter by date):** {list tables with >100M rows}
> Auto-generated by `/data:init`. Run `/data:init --refresh` to update.
```
**If yes:** Append the Quick Reference section to `.claude/CLAUDE.md` or `CLAUDE.md`.
## After Generation
Tell the user:
```
Generated .astro/warehouse.md
Summary:
- {N} databases, {N} schemas, {N} tables
- {N} tables enriched with code descriptions
- {N} concepts cached for instant lookup
Next steps:
1. Edit .astro/warehouse.md to add business context
2. Commit to version control
3. Run /data:init --refresh when schema changes
```
## Refresh Behavior
When `--refresh` is specified:
1. Read existing warehouse.md
2. Preserve all HTML comments (`<!-- ... -->`)
3. Preserve Quick Reference table entries (user-added)
4. Preserve user-added descriptions
5. Update row counts and add new tables
6. Mark removed tables with `<!-- REMOVED -->` comment
## Cache Staleness & Schema Drift
The runtime cache has a **7-day TTL** by default. After 7 days, cached entries expire and will be re-discovered on next use.
### When to Refresh
Run `/data:init --refresh` when:
- **Schema changes**: Tables added, renamed, or removed
- **Column changes**: New columns added or types changed
- **After deployments**: If your data pipeline deploys schema migrations
- **Weekly**: As a good practice, even if no known changes
### Signs of Stale Cache
Watch for these indicators:
- Queries fail with "table not found" errors
- Results seem wrong or outdated
- New tables aren't being discovered
### Manual Cache Reset
If you suspect cache issues:
```bash
# Scripts are relative to ../analyzing-data/
uv run scripts/cli.py cache status
uv run scripts/cli.py cache clear --stale-only
uv run scripts/cli.py cache clear
```
## Codebase Patterns Recognized
| Pattern | Source | What We Extract |
|---------|--------|-----------------|
| `**/models/**/*.yml` | dbt | table/column descriptions, tests |
| `**/dags/**/*.sql` | gusty | YAML frontmatter (description, primary_key) |
| `AGENTS.md`, `CLAUDE.md` | docs | data layer hierarchy, conventions |
| `**/docs/**/*.md` | docs | business context |
## Example Session
```
User: /data:init
Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT
→ Searching codebase for data documentation...
Found: AGENTS.md with data layer hierarchy
Found: 45 SQL files with YAML frontmatter in dags/declarative/
→ Launching parallel warehouse discovery...
[Database: HQ] Discovering schemas...
[Database: PRODUCT] Discovering schemas...
→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables
→ Merging warehouse metadata with code context...
Enriched 45 tables with descriptions from code
→ Generated .astro/warehouse.md
Summary:
- 2 databases
- 30 schemas
- 401 tables
- 45 tables enriched with code descriptions
- 8 large tables flagged (>100M rows)
Next steps:
1. Review .astro/warehouse.md
2. Add concept mappings to Quick Reference
3. Commit to version control
4. Run /data:init --refresh when schema changes
```Related Skills
minitest-testing
Write, review, and improve Minitest tests for Ruby on Rails applications. Covers model tests, controller tests, system tests, fixtures, and best practices from Rails Testing Guide.
aix-init
Initialize or upgrade aix framework in a project. Detects tech stack, generates appropriate tier structure, and sets up Claude Code integration.
session-init
Initializes session with environment check and task status overview. Use when user mentions セッション開始, 作業開始, 状況確認, what should I work on, start session. Do NOT load for: 実装作業, レビュー, セッション途中の作業.
always-init
Universal task initializer that automatically loads PAI context for all user requests. Ensures complete context availability (contacts, preferences, protocols) before responding to any task. (project, gitignored)
agentuity-cli-project-auth-init
Set up Agentuity Auth for your project. Requires authentication. Use for managing authentication credentials
affinity-mcp-workflows
Use when working with Affinity CRM via MCP tools - find entities, manage workflows, log interactions, prepare briefings, find warm intros. Also use when user mentions "pipeline", "deals", "relationship strength", or wants to prepare for meetings.
init-agents
Initialize or update AGENTS.md (or CLAUDE.md for Claude Code) with AI agent guidance. Use when user says 'init agents', 'create AGENTS.md', 'setup agent instructions', or wants project-specific AI coding assistant configuration.
agent-sdk-definitions
Programmatic agent definitions for the Claude Agent SDK in TypeScript and Python. Use when creating agents for SDK-based applications rather than filesystem-based Claude Code.
agent-init
Initialize or improve AGENTS.md files that define how coding agents operate in a repo. Use when asked to set up or replace an agent init command (Codex, Claude), standardize multi-agent behavior, or audit an existing AGENTS.md for clarity, commands, boundaries, and repo-specific context. For Claude Code, also create CLAUDE.md as a symlink to AGENTS.md.
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
moai-lang-r
R 4.4+ best practices with testthat 3.2, lintr 3.2, and data analysis patterns.
moai-lang-python
Python 3.13+ development specialist covering FastAPI, Django, async patterns, data science, testing with pytest, and modern Python features. Use when developing Python APIs, web applications, data pipelines, or writing tests.