databricks-lakebase-provisioned

Patterns and best practices for Lakebase Provisioned (Databricks managed PostgreSQL) for OLTP workloads. Use when creating Lakebase instances, connecting applications or Databricks Apps to PostgreSQL, implementing reverse ETL via synced tables, storing agent or chat memory, or configuring OAuth authentication for Lakebase.

38 stars

bymsbaek

View on GitHub Installation ↓

Best use case

databricks-lakebase-provisioned is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using databricks-lakebase-provisioned should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/databricks-lakebase-provisioned/SKILL.md --create-dirs "https://raw.githubusercontent.com/msbaek/dotfiles/main/.claude/skills/databricks-lakebase-provisioned/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/databricks-lakebase-provisioned/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How databricks-lakebase-provisioned Compares

Feature / Agent	databricks-lakebase-provisioned	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Lakebase Provisioned

Patterns and best practices for using Lakebase Provisioned (Databricks managed PostgreSQL) for OLTP workloads.

## When to Use

Use this skill when:
- Building applications that need a PostgreSQL database for transactional workloads
- Adding persistent state to Databricks Apps
- Implementing reverse ETL from Delta Lake to an operational database
- Storing chat/agent memory for LangChain applications

## Overview

Lakebase Provisioned is Databricks' managed PostgreSQL database service for OLTP (Online Transaction Processing) workloads. It provides a fully managed PostgreSQL-compatible database that integrates with Unity Catalog and supports OAuth token-based authentication.

| Feature | Description |
|---------|-------------|
| **Managed PostgreSQL** | Fully managed instances with automatic provisioning |
| **OAuth Authentication** | Token-based auth via Databricks SDK (1-hour expiry) |
| **Unity Catalog** | Register databases for governance |
| **Reverse ETL** | Sync data from Delta tables to PostgreSQL |
| **Apps Integration** | First-class support in Databricks Apps |

**Available Regions (AWS):** us-east-1, us-east-2, us-west-2, eu-central-1, eu-west-1, ap-south-1, ap-southeast-1, ap-southeast-2

## Quick Start

Create and connect to a Lakebase Provisioned instance:

```python
from databricks.sdk import WorkspaceClient
import uuid

# Initialize client
w = WorkspaceClient()

# Create a database instance
instance = w.database.create_database_instance(
    name="my-lakebase-instance",
    capacity="CU_1",  # CU_1, CU_2, CU_4, CU_8
    stopped=False
)
print(f"Instance created: {instance.name}")
print(f"DNS endpoint: {instance.read_write_dns}")
```

## Common Patterns

### Generate OAuth Token

```python
from databricks.sdk import WorkspaceClient
import uuid

w = WorkspaceClient()

# Generate OAuth token for database connection
cred = w.database.generate_database_credential(
    request_id=str(uuid.uuid4()),
    instance_names=["my-lakebase-instance"]
)
token = cred.token  # Use this as password in connection string
```

### Connect from Notebook

```python
import psycopg
from databricks.sdk import WorkspaceClient
import uuid

# Get instance details
w = WorkspaceClient()
instance = w.database.get_database_instance(name="my-lakebase-instance")

# Generate token
cred = w.database.generate_database_credential(
    request_id=str(uuid.uuid4()),
    instance_names=["my-lakebase-instance"]
)

# Connect using psycopg3
conn_string = f"host={instance.read_write_dns} dbname=postgres user={w.current_user.me().user_name} password={cred.token} sslmode=require"
with psycopg.connect(conn_string) as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT version()")
        print(cur.fetchone())
```

### SQLAlchemy with Token Refresh (Production)

For long-running applications, tokens must be refreshed (expire after 1 hour):

```python
import asyncio
import os
import uuid
from sqlalchemy import event
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
from databricks.sdk import WorkspaceClient

# Token refresh state
_current_token = None
_token_refresh_task = None
TOKEN_REFRESH_INTERVAL = 50 * 60  # 50 minutes (before 1-hour expiry)

def _generate_token(instance_name: str) -> str:
    """Generate fresh OAuth token."""
    w = WorkspaceClient()
    cred = w.database.generate_database_credential(
        request_id=str(uuid.uuid4()),
        instance_names=[instance_name]
    )
    return cred.token

async def _token_refresh_loop(instance_name: str):
    """Background task to refresh token every 50 minutes."""
    global _current_token
    while True:
        await asyncio.sleep(TOKEN_REFRESH_INTERVAL)
        _current_token = await asyncio.to_thread(_generate_token, instance_name)

def init_database(instance_name: str, database_name: str, username: str) -> AsyncEngine:
    """Initialize database with OAuth token injection."""
    global _current_token

    w = WorkspaceClient()
    instance = w.database.get_database_instance(name=instance_name)

    # Generate initial token
    _current_token = _generate_token(instance_name)

    # Build URL (password injected via do_connect)
    url = f"postgresql+psycopg://{username}@{instance.read_write_dns}:5432/{database_name}"

    engine = create_async_engine(
        url,
        pool_size=5,
        max_overflow=10,
        pool_recycle=3600,
        connect_args={"sslmode": "require"}
    )

    # Inject token on each connection
    @event.listens_for(engine.sync_engine, "do_connect")
    def provide_token(dialect, conn_rec, cargs, cparams):
        cparams["password"] = _current_token

    return engine
```

### Databricks Apps Integration

For Databricks Apps, use environment variables for configuration:

```python
# Environment variables set by Databricks Apps:
# - LAKEBASE_INSTANCE_NAME: Instance name
# - LAKEBASE_DATABASE_NAME: Database name
# - LAKEBASE_USERNAME: Username (optional, defaults to service principal)

import os

def is_lakebase_configured() -> bool:
    """Check if Lakebase is configured for this app."""
    return bool(
        os.environ.get("LAKEBASE_PG_URL") or
        (os.environ.get("LAKEBASE_INSTANCE_NAME") and
         os.environ.get("LAKEBASE_DATABASE_NAME"))
    )
```

Add Lakebase as an app resource via CLI:

```bash
databricks apps add-resource $APP_NAME \
    --resource-type database \
    --resource-name lakebase \
    --database-instance my-lakebase-instance
```

### Register with Unity Catalog

```python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Register database in Unity Catalog
w.database.register_database_instance(
    name="my-lakebase-instance",
    catalog="my_catalog",
    schema="my_schema"
)
```

### MLflow Model Resources

Declare Lakebase as a model resource for automatic credential provisioning:

```python
from mlflow.models.resources import DatabricksLakebase

resources = [
    DatabricksLakebase(database_instance_name="my-lakebase-instance"),
]

# When logging model
mlflow.langchain.log_model(
    model,
    artifact_path="model",
    resources=resources,
    pip_requirements=["databricks-langchain[memory]"]
)
```

## MCP Tools

The following MCP tools are available for managing Lakebase infrastructure. Use `type="provisioned"` for Lakebase Provisioned.

### manage_lakebase_database - Database Management

| Action | Description | Required Params |
|--------|-------------|-----------------|
| `create_or_update` | Create or update a database | name |
| `get` | Get database details | name |
| `list` | List all databases | (none, optional type filter) |
| `delete` | Delete database and resources | name |

**Example usage:**
```python
# Create a provisioned database
manage_lakebase_database(
    action="create_or_update",
    name="my-lakebase-instance",
    type="provisioned",
    capacity="CU_1"
)

# Get database details
manage_lakebase_database(action="get", name="my-lakebase-instance", type="provisioned")

# List all databases
manage_lakebase_database(action="list")

# Delete with cascade
manage_lakebase_database(action="delete", name="my-lakebase-instance", type="provisioned", force=True)
```

### manage_lakebase_sync - Reverse ETL

| Action | Description | Required Params |
|--------|-------------|-----------------|
| `create_or_update` | Set up reverse ETL from Delta to Lakebase | instance_name, source_table_name, target_table_name |
| `delete` | Remove synced table (and optionally catalog) | table_name |

**Example usage:**
```python
# Set up reverse ETL
manage_lakebase_sync(
    action="create_or_update",
    instance_name="my-lakebase-instance",
    source_table_name="catalog.schema.delta_table",
    target_table_name="lakebase_catalog.schema.postgres_table",
    scheduling_policy="TRIGGERED"  # or SNAPSHOT, CONTINUOUS
)

# Delete synced table
manage_lakebase_sync(action="delete", table_name="lakebase_catalog.schema.postgres_table")
```

### generate_lakebase_credential - OAuth Tokens

Generate OAuth token (~1hr) for PostgreSQL connections. Use as password with `sslmode=require`.

```python
# For provisioned instances
generate_lakebase_credential(instance_names=["my-lakebase-instance"])
```

## Reference Files

- [connection-patterns.md](connection-patterns.md) - Detailed connection patterns for different use cases
- [reverse-etl.md](reverse-etl.md) - Syncing data from Delta Lake to Lakebase

## CLI Quick Reference

```bash
# Create instance
databricks database create-database-instance \
    --name my-lakebase-instance \
    --capacity CU_1

# Get instance details
databricks database get-database-instance --name my-lakebase-instance

# Generate credentials
databricks database generate-database-credential \
    --request-id $(uuidgen) \
    --json '{"instance_names": ["my-lakebase-instance"]}'

# List instances
databricks database list-database-instances

# Stop instance (saves cost)
databricks database stop-database-instance --name my-lakebase-instance

# Start instance
databricks database start-database-instance --name my-lakebase-instance
```

## Common Issues

| Issue | Solution |
|-------|----------|
| **Token expired during long query** | Implement token refresh loop (see SQLAlchemy with Token Refresh section); tokens expire after 1 hour |
| **DNS resolution fails on macOS** | Use `dig` command to resolve hostname, pass `hostaddr` to psycopg |
| **Connection refused** | Ensure instance is not stopped; check `instance.state` |
| **Permission denied** | User must be granted access to the Lakebase instance |
| **SSL required error** | Always use `sslmode=require` in connection string |

## SDK Version Requirements

- **Databricks SDK for Python**: >= 0.61.0 (0.81.0+ recommended for full API support)
- **psycopg**: 3.x (supports `hostaddr` parameter for DNS workaround)
- **SQLAlchemy**: 2.x with `postgresql+psycopg` driver

```python
%pip install -U "databricks-sdk>=0.81.0" "psycopg[binary]>=3.0" sqlalchemy
```

## Notes

- **Capacity values** use compute unit sizing: `CU_1`, `CU_2`, `CU_4`, `CU_8`.
- **Lakebase Autoscaling** is a newer offering with automatic scaling but limited regional availability. This skill focuses on **Lakebase Provisioned** which is more widely available.
- For memory/state in LangChain agents, use `databricks-langchain[memory]` which includes Lakebase support.
- Tokens are short-lived (1 hour) - production apps MUST implement token refresh.

## Related Skills

- **[databricks-app-apx](../databricks-app-apx/SKILL.md)** - full-stack apps that can use Lakebase for persistence
- **[databricks-app-python](../databricks-app-python/SKILL.md)** - Python apps with Lakebase backend
- **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** - SDK used for instance management and token generation
- **[databricks-bundles](../databricks-bundles/SKILL.md)** - deploying apps with Lakebase resources
- **[databricks-jobs](../databricks-jobs/SKILL.md)** - scheduling reverse ETL sync jobs

Related Skills

databricks-zerobus-ingest

from msbaek/dotfiles

Build Zerobus Ingest clients for near real-time data ingestion into Databricks Delta tables via gRPC. Use when creating producers that write directly to Unity Catalog tables without a message bus, working with the Zerobus Ingest SDK in Python/Java/Go/TypeScript/Rust, generating Protobuf schemas from UC tables, or implementing stream-based ingestion with ACK handling and retry logic.

databricks-vector-search

from msbaek/dotfiles

Patterns for Databricks Vector Search: create endpoints and indexes, query with filters, manage embeddings. Use when building RAG applications, semantic search, or similarity matching. Covers both storage-optimized and standard endpoints.

databricks-unstructured-pdf-generation

from msbaek/dotfiles

Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets.

databricks-unity-catalog

from msbaek/dotfiles

Unity Catalog system tables and volumes. Use when querying system tables (audit, lineage, billing) or working with volume file operations (upload, download, list files in /Volumes/).

databricks-synthetic-data-gen

from msbaek/dotfiles

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

databricks-spark-structured-streaming

from msbaek/dotfiles

Comprehensive guide to Spark Structured Streaming for production workloads. Use when building streaming pipelines, working with Kafka ingestion, implementing Real-Time Mode (RTM), configuring triggers (processingTime, availableNow), handling stateful operations with watermarks, optimizing checkpoints, performing stream-stream or stream-static joins, writing to multiple sinks, or tuning streaming cost and performance.

databricks-spark-declarative-pipelines

from msbaek/dotfiles

Creates, configures, and updates Databricks Lakeflow Spark Declarative Pipelines (SDP/LDP) using serverless compute. Handles data ingestion with streaming tables, materialized views, CDC, SCD Type 2, and Auto Loader ingestion patterns. Use when building data pipelines, working with Delta Live Tables, ingesting streaming data, implementing change data capture, or when the user mentions SDP, LDP, DLT, Lakeflow pipelines, streaming tables, or bronze/silver/gold medallion architectures.

databricks-python-sdk

from msbaek/dotfiles

Databricks development guidance including Python SDK, Databricks Connect, CLI, and REST API. Use when working with databricks-sdk, databricks-connect, or Databricks APIs.

databricks-model-serving

from msbaek/dotfiles

Deploy and query Databricks Model Serving endpoints. Use when (1) deploying MLflow models or AI agents to endpoints, (2) creating ChatAgent/ResponsesAgent agents, (3) integrating UC Functions or Vector Search tools, (4) querying deployed endpoints, (5) checking endpoint status. Covers classical ML models, custom pyfunc, and GenAI agents.

databricks-mlflow-evaluation

from msbaek/dotfiles

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

databricks-metric-views

from msbaek/dotfiles

Unity Catalog metric views: define, create, query, and manage governed business metrics in YAML. Use when building standardized KPIs, revenue metrics, order analytics, or any reusable business metrics that need consistent definitions across teams and tools.

databricks-lakebase-autoscale

from msbaek/dotfiles

Patterns and best practices for Lakebase Autoscaling (next-gen managed PostgreSQL). Use when creating or managing Lakebase Autoscaling projects, configuring autoscaling compute or scale-to-zero, working with database branching for dev/test workflows, implementing reverse ETL via synced tables, or connecting applications to Lakebase with OAuth credentials.