analyzing-data

Queries data warehouse and answers business questions about data. Handles questions requiring database/warehouse queries including "who uses X", "how many Y", "show me Z", "find customers", "what is the count", data lookups, metrics, trends, or SQL analysis.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

analyzing-data is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using analyzing-data should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/analyzing-data/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/analyzing-data/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/analyzing-data/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How analyzing-data Compares

Feature / Agent	analyzing-data	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Data Analysis

Answer business questions by querying the data warehouse. The kernel starts automatically on first use.

## Prerequisites

**uv must be installed:**
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Scripts are located relative to this skill file.

## MANDATORY FIRST STEP

**Before any other action, check for cached patterns:**

```bash
uv run scripts/cli.py pattern lookup "<user's question>"
```

This is NON-NEGOTIABLE. Patterns contain proven strategies that save time and avoid failed queries.

---

## Workflow

```
Analysis Progress:
- [ ] Step 1: pattern lookup (check for cached strategy)
- [ ] Step 2: concept lookup (check for known tables)
- [ ] Step 3: Search codebase for table definitions (Grep)
- [ ] Step 4: Read SQL file to get table/column names
- [ ] Step 5: Execute query via kernel (run_sql)
- [ ] Step 6: learn_concept (ALWAYS before presenting results)
- [ ] Step 7: learn_pattern (ALWAYS if discovery required)
- [ ] Step 8: record_pattern_outcome (if you used a pattern in Step 1)
- [ ] Step 9: Present findings to user
```

---

## CLI Commands

### Kernel Management

```bash
uv run scripts/cli.py start           # Start kernel with Snowflake
uv run scripts/cli.py exec "..."      # Execute Python code
uv run scripts/cli.py status          # Check kernel status
uv run scripts/cli.py restart         # Restart kernel
uv run scripts/cli.py stop            # Stop kernel
uv run scripts/cli.py install plotly  # Install additional packages
```

### Concept Cache (concept -> table mappings)

```bash
# Look up a concept
uv run scripts/cli.py concept lookup customers

# Learn a new concept
uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

# List all concepts
uv run scripts/cli.py concept list

# Import concepts from warehouse.md
uv run scripts/cli.py concept import -p /path/to/warehouse.md
```

### Pattern Cache (query strategies)

```bash
# Look up patterns for a question
uv run scripts/cli.py pattern lookup "who uses operator X"

# Learn a new pattern
uv run scripts/cli.py pattern learn operator_usage \
    -q "who uses X operator" \
    -q "which customers use X" \
    -s "1. Query TASK_RUNS for operator_class" \
    -s "2. Join with ORGS on org_id" \
    -t "HQ.MODEL_ASTRO.TASK_RUNS" \
    -t "HQ.MODEL_ASTRO.ORGANIZATIONS" \
    -g "TASK_RUNS is huge - always filter by date"

# Record pattern outcome
uv run scripts/cli.py pattern record operator_usage --success

# List all patterns
uv run scripts/cli.py pattern list

# Delete a pattern
uv run scripts/cli.py pattern delete operator_usage
```

### Table Schema Cache

```bash
# Look up cached table schema
uv run scripts/cli.py table lookup HQ.MART_CUST.CURRENT_ASTRO_CUSTS

# Cache a table schema
uv run scripts/cli.py table cache DB.SCHEMA.TABLE -c '[{"name":"id","type":"INT"}]'

# List all cached tables
uv run scripts/cli.py table list

# Delete from cache
uv run scripts/cli.py table delete DB.SCHEMA.TABLE
```

### Cache Management

```bash
# View cache statistics
uv run scripts/cli.py cache status

# Clear all caches
uv run scripts/cli.py cache clear

# Clear only stale entries (older than 90 days)
uv run scripts/cli.py cache clear --stale-only
```

---

## Quick Start Example

```bash
# 1. Check for existing patterns
uv run scripts/cli.py pattern lookup "how many customers"

# 2. Check for known concepts
uv run scripts/cli.py concept lookup customers

# 3. Execute query
uv run scripts/cli.py exec "df = run_sql('SELECT COUNT(*) FROM HQ.MART_CUST.CURRENT_ASTRO_CUSTS')"
uv run scripts/cli.py exec "print(df)"

# 4. Cache what we learned
uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
```

---

## Available Functions in Kernel

Once kernel starts, these are available:

| Function | Description |
|----------|-------------|
| `run_sql(query, limit=100)` | Execute SQL, return Polars DataFrame |
| `run_sql_pandas(query, limit=100)` | Execute SQL, return Pandas DataFrame |
| `pl` | Polars library (imported) |
| `pd` | Pandas library (imported) |

---

## Table Discovery via Codebase

If concept/pattern cache miss, search the codebase:

```
Grep pattern="<concept>" glob="**/*.sql"
```

| Repo Type | Where to Look |
|-----------|---------------|
| **Gusty** | `dags/declarative/04_metric/`, `06_reporting/`, `05_mart/` |
| **dbt** | `models/marts/`, `models/staging/` |

---

## Known Tables Quick Reference

| Concept | Table | Key Column | Date Column |
|---------|-------|------------|-------------|
| customers | HQ.MART_CUST.CURRENT_ASTRO_CUSTS | ACCT_ID | - |
| organizations | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_TS |
| deployments | HQ.MODEL_ASTRO.DEPLOYMENTS | DEPLOYMENT_ID | CREATED_TS |
| task_runs | HQ.MODEL_ASTRO.TASK_RUNS | - | START_TS |
| dag_runs | HQ.MODEL_ASTRO.DAG_RUNS | - | START_TS |
| users | HQ.MODEL_ASTRO.USERS | USER_ID | - |
| accounts | HQ.MODEL_CRM.SF_ACCOUNTS | ACCT_ID | - |

**Large tables (always filter by date):** TASK_RUNS (6B rows), DAG_RUNS (500M rows)

---

## Query Tips

- Use LIMIT during exploration
- Filter early with WHERE clauses
- Prefer pre-aggregated tables (`METRICS_*`, `MART_*`, `AGG_*`)
- For 100M+ row tables: no JOINs or GROUP BY on first query

---

## Reference

- [reference/discovery-warehouse.md](reference/discovery-warehouse.md) - Large table handling, warehouse discovery

Related Skills

large-data-with-dask

from diegosouzapw/awesome-omni-skill

Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.

ipdata-co-automation

from diegosouzapw/awesome-omni-skill

Automate Ipdata co tasks via Rube MCP (Composio). Always search tools first for current schemas.

gdpr-data-handling

from diegosouzapw/awesome-omni-skill

Implement GDPR-compliant data handling with consent management, data subject rights, and privacy by design. Use when building systems that process EU personal data, implementing privacy controls, o...

fair-data-model-assessment

from diegosouzapw/awesome-omni-skill

Assess data models against FAIR principles using RDA-FDMM indicators. Use when: (1) Evaluating vendor-delivered data models for FAIR compliance, (2) Reviewing schemas, ontologies, or data dictionaries before integration, (3) Creating FAIR assessment reports for data governance reviews, (4) Preparing data model documentation for enterprise or regulatory standards, (5) Auditing existing data assets for FAIRness gaps. Covers 41 RDA indicators across Findable, Accessible, Interoperable, Reusable dimensions with maturity scoring (0-4 scale).

docker-database

from diegosouzapw/awesome-omni-skill

Configure database containers with security, persistence, and health checks

datarobot-automation

from diegosouzapw/awesome-omni-skill

Automate Datarobot tasks via Rube MCP (Composio). Always search tools first for current schemas.

dataql-analysis

from diegosouzapw/awesome-omni-skill

Analyze data files using SQL queries with DataQL. Use when working with CSV, JSON, Parquet, Excel files or when the user mentions data analysis, filtering, aggregation, or SQL queries on files.

datahub-connector-pr-review

from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "review my connector", "check my datahub connector", "review connector code", "audit connector", "review PR", "check code quality", or any request to review/check/audit a DataHub ingestion source. Covers compliance with standards, best practices, testing quality, and merge readiness.

datagma-automation

from diegosouzapw/awesome-omni-skill

Automate Datagma tasks via Rube MCP (Composio). Always search tools first for current schemas.

Database Sync

from diegosouzapw/awesome-omni-skill

Automate database synchronization, replication, migration, and cross-platform data integration

database-skill

from diegosouzapw/awesome-omni-skill

Design and manage relational databases including table creation, migrations, and schema design. Use for database modeling and maintenance.

database-architect

from diegosouzapw/awesome-omni-skill

Database design and optimization specialist. Schema design, query optimization, indexing strategies, data modeling, and migration planning for relational and NoSQL databases.