databricks-reference-architecture

Implement Databricks reference architecture with best-practice project layout. Use when designing new Databricks projects, reviewing architecture, or establishing standards for Databricks applications. Trigger with phrases like "databricks architecture", "databricks best practices", "databricks project structure", "how to organize databricks", "databricks layout".

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

databricks-reference-architecture is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using databricks-reference-architecture should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/databricks-reference-architecture/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/jeremylongshore/claude-code-plugins-plus-skills/databricks-reference-architecture/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/databricks-reference-architecture/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How databricks-reference-architecture Compares

Feature / Agent	databricks-reference-architecture	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Databricks Reference Architecture

## Overview
Production-ready lakehouse architecture with Unity Catalog, Delta Lake, and the medallion pattern. Covers workspace organization, three-level namespace governance, compute strategy, CI/CD with Asset Bundles, and project structure for team collaboration.

## Prerequisites
- Databricks workspace with Unity Catalog enabled
- Understanding of medallion architecture (bronze/silver/gold)
- Databricks CLI configured
- Terraform or Asset Bundles for infrastructure

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    UNITY CATALOG                                  │
│                                                                   │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌───────────┐  │
│  │  Bronze    │  │  Silver    │  │   Gold     │  │ ML Models │  │
│  │  Catalog   │─▶│  Catalog   │─▶│  Catalog   │  │ (MLflow)  │  │
│  │  (raw)     │  │  (clean)   │  │  (curated) │  │           │  │
│  └────────────┘  └────────────┘  └────────────┘  └───────────┘  │
│       ▲                                    │                      │
│  ┌────────────┐                   ┌────────────────┐             │
│  │ Auto Loader│                   │ Model Serving  │             │
│  │ Ingestion  │                   │ Endpoints      │             │
│  └────────────┘                   └────────────────┘             │
├─────────────────────────────────────────────────────────────────┤
│  Compute: Job Clusters │ SQL Warehouses │ Instance Pools        │
├─────────────────────────────────────────────────────────────────┤
│  Security: Row Filters │ Column Masks │ Secret Scopes │ SCIM   │
├─────────────────────────────────────────────────────────────────┤
│  CI/CD: Asset Bundles │ GitHub Actions │ dev/staging/prod       │
└─────────────────────────────────────────────────────────────────┘
```

## Project Structure

```
databricks-platform/
├── src/
│   ├── ingestion/
│   │   ├── bronze_raw_events.py       # Auto Loader streaming
│   │   ├── bronze_api_data.py         # REST API batch ingestion
│   │   └── bronze_file_uploads.py     # Manual file uploads
│   ├── transformation/
│   │   ├── silver_clean_events.py     # Cleansing + dedup
│   │   ├── silver_schema_enforce.py   # Schema validation
│   │   └── silver_scd2.py            # Slowly changing dimensions
│   ├── aggregation/
│   │   ├── gold_daily_metrics.py      # Business KPIs
│   │   ├── gold_user_features.py      # ML feature engineering
│   │   └── gold_reporting.py          # BI-ready views
│   └── ml/
│       ├── training/
│       │   └── train_churn_model.py
│       └── inference/
│           └── batch_scoring.py
├── tests/
│   ├── conftest.py                    # Spark fixtures
│   ├── unit/                          # Local Spark tests
│   └── integration/                   # Databricks Connect tests
├── resources/
│   ├── etl_jobs.yml                   # ETL job definitions
│   ├── ml_jobs.yml                    # ML pipeline definitions
│   └── maintenance.yml                # OPTIMIZE/VACUUM schedules
├── databricks.yml                     # Asset Bundle root config
├── pyproject.toml
└── requirements.txt
```

## Instructions

### Step 1: Unity Catalog Hierarchy
```sql
-- One catalog per environment (or shared with schema isolation)
CREATE CATALOG IF NOT EXISTS dev_catalog;
CREATE CATALOG IF NOT EXISTS prod_catalog;

-- Medallion schemas per catalog
CREATE SCHEMA IF NOT EXISTS prod_catalog.bronze;
CREATE SCHEMA IF NOT EXISTS prod_catalog.silver;
CREATE SCHEMA IF NOT EXISTS prod_catalog.gold;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_features;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_models;

-- Permissions: engineers write bronze/silver, analysts read gold
GRANT USAGE ON CATALOG prod_catalog TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.bronze TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.silver TO `data-engineers`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-engineers`;

GRANT USAGE ON CATALOG prod_catalog TO `data-analysts`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-analysts`;
```

### Step 2: Asset Bundle Configuration
```yaml
# databricks.yml
bundle:
  name: data-platform

workspace:
  host: ${DATABRICKS_HOST}

include:
  - resources/*.yml

variables:
  catalog:
    default: dev_catalog
  alert_email:
    default: dev@company.com

targets:
  dev:
    default: true
    mode: development
    workspace:
      root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev

  staging:
    variables:
      catalog: staging_catalog

  prod:
    mode: production
    variables:
      catalog: prod_catalog
      alert_email: oncall@company.com
    workspace:
      root_path: /Shared/.bundle/${bundle.name}/prod
```

### Step 3: Compute Strategy
```yaml
# resources/etl_jobs.yml
resources:
  jobs:
    daily_etl:
      name: "daily-etl-${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "UTC"
      max_concurrent_runs: 1

      tasks:
        - task_key: bronze
          notebook_task:
            notebook_path: src/ingestion/bronze_raw_events.py
          job_cluster_key: etl

        - task_key: silver
          depends_on: [{task_key: bronze}]
          notebook_task:
            notebook_path: src/transformation/silver_clean_events.py
          job_cluster_key: etl

        - task_key: gold
          depends_on: [{task_key: silver}]
          notebook_task:
            notebook_path: src/aggregation/gold_daily_metrics.py
          job_cluster_key: etl

      job_clusters:
        - job_cluster_key: etl
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "i3.xlarge"
            autoscale:
              min_workers: 1
              max_workers: 4
            aws_attributes:
              availability: SPOT_WITH_FALLBACK
              first_on_demand: 1
            spark_conf:
              spark.databricks.delta.optimizeWrite.enabled: "true"
              spark.databricks.delta.autoCompact.enabled: "true"
```

### Step 4: Medallion Pipeline Pattern
```python
# src/ingestion/bronze_raw_events.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name

spark = SparkSession.builder.getOrCreate()

# Bronze: Auto Loader for incremental file ingestion
raw = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/bronze/events/schema")
    .option("cloudFiles.inferColumnTypes", "true")
    .load("s3://data-lake/raw/events/")
    .withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
)

(raw.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/bronze/events/data")
    .toTable("prod_catalog.bronze.raw_events"))
```

### Step 5: Table Maintenance Schedule
```yaml
# resources/maintenance.yml
resources:
  jobs:
    weekly_optimize:
      name: "maintenance-optimize-${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 2 ? * SUN"
        timezone_id: "UTC"
      tasks:
        - task_key: optimize_tables
          notebook_task:
            notebook_path: src/maintenance/optimize_tables.py
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "m5.xlarge"
            num_workers: 1
```

```python
# src/maintenance/optimize_tables.py
tables_to_optimize = [
    ("prod_catalog.silver.orders", ["order_date", "region"]),
    ("prod_catalog.silver.events", ["event_date"]),
    ("prod_catalog.gold.daily_metrics", []),
]

for table, z_cols in tables_to_optimize:
    if z_cols:
        spark.sql(f"OPTIMIZE {table} ZORDER BY ({', '.join(z_cols)})")
    else:
        spark.sql(f"OPTIMIZE {table}")
    spark.sql(f"VACUUM {table} RETAIN 168 HOURS")
    print(f"Maintained: {table}")
```

## Output
- Unity Catalog hierarchy with env-isolated catalogs and medallion schemas
- Asset Bundle with dev/staging/prod targets and variable overrides
- Medallion pipeline (Auto Loader > MERGE > aggregations)
- RBAC grants separating engineer write from analyst read-only
- Table maintenance schedule (weekly OPTIMIZE + VACUUM)

## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Schema evolution failure | New source columns | Auto Loader handles with `schemaEvolutionMode` |
| Permission denied on schema | Missing `USAGE` on parent catalog | `GRANT USAGE ON CATALOG` first |
| Concurrent write conflict | Multiple jobs writing same table | `max_concurrent_runs: 1` in job config |
| Cluster timeout | Long-running tasks | Set `timeout_seconds` per task |

## Examples

### Validate Data Flow
```sql
SELECT 'bronze' AS layer, COUNT(*) AS rows FROM prod_catalog.bronze.raw_events
UNION ALL SELECT 'silver', COUNT(*) FROM prod_catalog.silver.events
UNION ALL SELECT 'gold', COUNT(*) FROM prod_catalog.gold.daily_metrics;
```

## Resources
- [Unity Catalog Best Practices](https://docs.databricks.com/aws/en/data-governance/unity-catalog/best-practices)
- [Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)
- [Declarative Automation Bundles](https://docs.databricks.com/aws/en/dev-tools/bundles/)
- [Auto Loader](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/)

Related Skills

exa-reference-architecture

from ComeOnOliver/skillshub

Implement Exa reference architecture for search pipelines, RAG, and content discovery. Use when designing new Exa integrations, reviewing project structure, or establishing architecture standards for neural search applications. Trigger with phrases like "exa architecture", "exa project structure", "exa RAG pipeline", "exa reference design", "exa search pipeline".

exa-architecture-variants

from ComeOnOliver/skillshub

Choose and implement Exa architecture patterns at different scales: direct search, cached search, and RAG pipeline. Use when designing Exa integrations, choosing between simple search and full RAG, or planning architecture for different traffic volumes. Trigger with phrases like "exa architecture", "exa blueprint", "how to structure exa", "exa RAG design", "exa at scale".

evernote-reference-architecture

from ComeOnOliver/skillshub

Reference architecture for Evernote integrations. Use when designing system architecture, planning integrations, or building scalable Evernote applications. Trigger with phrases like "evernote architecture", "design evernote system", "evernote integration pattern", "evernote scale".

elevenlabs-reference-architecture

from ComeOnOliver/skillshub

Implement ElevenLabs reference architecture for production TTS/voice applications. Use when designing new ElevenLabs integrations, reviewing project structure, or building a scalable audio generation service. Trigger: "elevenlabs architecture", "elevenlabs project structure", "how to organize elevenlabs", "TTS service architecture", "elevenlabs design patterns", "voice API architecture".

documenso-reference-architecture

from ComeOnOliver/skillshub

Implement Documenso reference architecture with best-practice project layout. Use when designing new Documenso integrations, reviewing project structure, or establishing architecture standards for document signing applications. Trigger with phrases like "documenso architecture", "documenso best practices", "documenso project structure", "how to organize documenso".

deepgram-reference-architecture

from ComeOnOliver/skillshub

Implement Deepgram reference architecture for scalable transcription systems. Use when designing transcription pipelines, building production architectures, or planning Deepgram integration at scale. Trigger: "deepgram architecture", "transcription pipeline", "deepgram system design", "deepgram at scale", "enterprise deepgram", "deepgram queue".

databricks-webhooks-events

from ComeOnOliver/skillshub

Configure Databricks job notifications, webhooks, and event handling. Use when setting up Slack/Teams notifications, configuring alerts, or integrating Databricks events with external systems. Trigger with phrases like "databricks webhook", "databricks notifications", "databricks alerts", "job failure notification", "databricks slack".

databricks-upgrade-migration

from ComeOnOliver/skillshub

Upgrade Databricks runtime versions and migrate between features. Use when upgrading DBR versions, migrating to Unity Catalog, or updating deprecated APIs and features. Trigger with phrases like "databricks upgrade", "DBR upgrade", "databricks migration", "unity catalog migration", "hive to unity".

databricks-security-basics

from ComeOnOliver/skillshub

Apply Databricks security best practices for secrets and access control. Use when securing API tokens, implementing least privilege access, or auditing Databricks security configuration. Trigger with phrases like "databricks security", "databricks secrets", "secure databricks", "databricks token security", "databricks scopes".

databricks-sdk-patterns

from ComeOnOliver/skillshub

Apply production-ready Databricks SDK patterns for Python and REST API. Use when implementing Databricks integrations, refactoring SDK usage, or establishing team coding standards for Databricks. Trigger with phrases like "databricks SDK patterns", "databricks best practices", "databricks code patterns", "idiomatic databricks".

databricks-rate-limits

from ComeOnOliver/skillshub

Implement Databricks API rate limiting, backoff, and idempotency patterns. Use when handling rate limit errors, implementing retry logic, or optimizing API request throughput for Databricks. Trigger with phrases like "databricks rate limit", "databricks throttling", "databricks 429", "databricks retry", "databricks backoff".

databricks-prod-checklist

from ComeOnOliver/skillshub

Execute Databricks production deployment checklist and rollback procedures. Use when deploying Databricks jobs to production, preparing for launch, or implementing go-live procedures. Trigger with phrases like "databricks production", "deploy databricks", "databricks go-live", "databricks launch checklist".