databricks-observability

Set up comprehensive observability for Databricks with metrics, traces, and alerts. Use when implementing monitoring for Databricks jobs, setting up dashboards, or configuring alerting for pipeline health. Trigger with phrases like "databricks monitoring", "databricks metrics", "databricks observability", "monitor databricks", "databricks alerts", "databricks logging".

1,868 stars

Best use case

databricks-observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Set up comprehensive observability for Databricks with metrics, traces, and alerts. Use when implementing monitoring for Databricks jobs, setting up dashboards, or configuring alerting for pipeline health. Trigger with phrases like "databricks monitoring", "databricks metrics", "databricks observability", "monitor databricks", "databricks alerts", "databricks logging".

Teams using databricks-observability should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/databricks-observability/SKILL.md --create-dirs "https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/main/plugins/saas-packs/databricks-pack/skills/databricks-observability/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/databricks-observability/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How databricks-observability Compares

Feature / Agentdatabricks-observabilityStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Set up comprehensive observability for Databricks with metrics, traces, and alerts. Use when implementing monitoring for Databricks jobs, setting up dashboards, or configuring alerting for pipeline health. Trigger with phrases like "databricks monitoring", "databricks metrics", "databricks observability", "monitor databricks", "databricks alerts", "databricks logging".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Databricks Observability

## Overview
Monitor Databricks jobs, clusters, SQL warehouses, and costs using system tables in the `system` catalog. System tables provide queryable observability data: `system.lakeflow` (job runs), `system.billing` (costs), `system.query` (SQL history), `system.access` (audit logs), and `system.compute` (cluster metrics). Data updates throughout the day, not real-time.

## Prerequisites
- Databricks Premium or Enterprise with Unity Catalog enabled
- Access to `system.billing`, `system.lakeflow`, `system.query`, and `system.access` schemas
- SQL warehouse for running monitoring queries

## Instructions

### Step 1: Job Health Monitoring
```sql
-- Job success/failure over last 24 hours
SELECT
    COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) AS succeeded,
    COUNT(CASE WHEN result_state = 'FAILED' THEN 1 END) AS failed,
    COUNT(CASE WHEN result_state = 'TIMED_OUT' THEN 1 END) AS timed_out,
    ROUND(100.0 * COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) / COUNT(*), 1) AS success_rate_pct,
    ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) AS avg_duration_min
FROM system.lakeflow.job_run_timeline
WHERE start_time > current_timestamp() - INTERVAL 24 HOURS;

-- Failed jobs with error details
SELECT job_id, run_name, result_state, start_time, end_time,
       TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration_min,
       error_message
FROM system.lakeflow.job_run_timeline
WHERE result_state = 'FAILED'
  AND start_time > current_timestamp() - INTERVAL 24 HOURS
ORDER BY start_time DESC;
```

### Step 2: Cluster Utilization and Costs
```sql
-- DBU consumption by cluster (last 7 days)
SELECT usage_metadata.cluster_id,
       COALESCE(usage_metadata.cluster_name, 'unnamed') AS cluster_name,
       sku_name,
       SUM(usage_quantity) AS total_dbus,
       ROUND(SUM(usage_quantity * p.pricing.default), 2) AS cost_usd
FROM system.billing.usage u
LEFT JOIN system.billing.list_prices p ON u.sku_name = p.sku_name
WHERE u.usage_date >= current_date() - INTERVAL 7 DAYS
GROUP BY usage_metadata.cluster_id, cluster_name, u.sku_name
ORDER BY cost_usd DESC
LIMIT 20;
```

### Step 3: SQL Warehouse Performance
```sql
-- Slow queries (>30s) on SQL warehouses
SELECT warehouse_id, statement_id, executed_by,
       ROUND(total_duration_ms / 1000, 1) AS duration_sec,
       rows_produced,
       ROUND(bytes_scanned / 1048576, 1) AS scanned_mb,
       LEFT(statement_text, 200) AS query_preview
FROM system.query.history
WHERE total_duration_ms > 30000
  AND start_time > current_timestamp() - INTERVAL 24 HOURS
ORDER BY total_duration_ms DESC
LIMIT 50;

-- Warehouse queue times (right-sizing indicator)
SELECT warehouse_id, warehouse_name,
       COUNT(*) AS query_count,
       ROUND(AVG(total_duration_ms) / 1000, 1) AS avg_sec,
       ROUND(MAX(queue_duration_ms) / 1000, 1) AS max_queue_sec
FROM system.query.history
WHERE start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY warehouse_id, warehouse_name;
```

### Step 4: Cost-per-Job Analysis
```sql
SELECT j.name AS job_name,
       COUNT(DISTINCT r.run_id) AS run_count,
       ROUND(AVG(TIMESTAMPDIFF(MINUTE, r.start_time, r.end_time)), 1) AS avg_min,
       ROUND(SUM(b.usage_quantity), 1) AS total_dbus,
       ROUND(SUM(b.usage_quantity * p.pricing.default), 2) AS total_cost_usd
FROM system.lakeflow.job_run_timeline r
JOIN system.lakeflow.jobs j ON r.job_id = j.job_id
LEFT JOIN system.billing.usage b
    ON r.run_id = b.usage_metadata.job_run_id
LEFT JOIN system.billing.list_prices p ON b.sku_name = p.sku_name
WHERE r.start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY j.name
ORDER BY total_cost_usd DESC
LIMIT 15;
```

### Step 5: SQL Alerts for Automated Notifications
```sql
-- Create as SQL Alert: trigger when failure_count > 3
-- Schedule: every 15 minutes
-- Notification destination: Slack/email

SELECT COUNT(*) AS failure_count
FROM system.lakeflow.job_run_timeline
WHERE result_state = 'FAILED'
  AND start_time > current_timestamp() - INTERVAL 1 HOUR;
```

```python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create SQL alert programmatically
alert = w.alerts.create(
    name="Hourly Job Failure Alert",
    query_id="<saved-query-id>",
    options={"column": "failure_count", "op": ">", "value": "3"},
    rearm=900,  # re-alert after 15 min if still triggered
)
```

### Step 6: Export Metrics to External Systems
```python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Export cluster state metrics for Prometheus/Datadog
for cluster in w.clusters.list():
    if cluster.state.value == "RUNNING":
        print(f"databricks_cluster_workers{{name=\"{cluster.cluster_name}\"}} "
              f"{cluster.num_workers}")
        print(f"databricks_cluster_running{{name=\"{cluster.cluster_name}\"}} 1")

# Export job success rate for Grafana
runs = list(w.jobs.list_runs(limit=100, completed_only=True))
success = sum(1 for r in runs if r.state.result_state and r.state.result_state.value == "SUCCESS")
print(f"databricks_job_success_rate {success / len(runs):.2f}")
```

### Step 7: Audit Log Monitoring
```sql
-- Security: who accessed what in the last 7 days
SELECT event_time, user_identity.email, action_name,
       request_params, response.status_code
FROM system.access.audit
WHERE service_name IN ('unityCatalog', 'jobs', 'clusters')
  AND event_date >= current_date() - 7
  AND action_name NOT IN ('getStatus', 'list')  -- exclude noisy reads
ORDER BY event_time DESC
LIMIT 100;
```

## Output
- Job health dashboard (success rate, duration, failures)
- Cluster cost breakdown by team and SKU
- SQL warehouse performance report (slow queries, queue times)
- Per-job cost analysis
- Automated SQL alerts with Slack/email notifications
- External metric export for Prometheus/Grafana

## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| System tables empty | Unity Catalog not enabled | Enable in Account Console > Settings |
| `TABLE_OR_VIEW_NOT_FOUND` | Schema not accessible | Request admin to grant `SELECT ON system.billing` |
| Billing data delayed | System table refresh lag (up to 24h) | Use for trends and alerts, not real-time |
| Query history missing | Serverless queries not tracked | Use classic SQL warehouse or check retention |

## Examples

### Daily Standup Dashboard
```sql
-- Single query for daily pipeline health
SELECT
    'Last 24h' AS period,
    COUNT(*) AS total_runs,
    COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) AS ok,
    COUNT(CASE WHEN result_state = 'FAILED' THEN 1 END) AS failed,
    ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) AS avg_min
FROM system.lakeflow.job_run_timeline
WHERE start_time > current_timestamp() - INTERVAL 24 HOURS;
```

## Resources
- [System Tables](https://docs.databricks.com/aws/en/admin/system-tables/)
- [Audit Logs](https://docs.databricks.com/aws/en/admin/system-tables/audit-logs)
- [Observability Best Practices](https://docs.databricks.com/aws/en/data-engineering/observability-best-practices)
- [SQL Alerts](https://docs.databricks.com/aws/en/sql/user/alerts/)

Related Skills

windsurf-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor Windsurf AI adoption, feature usage, and team productivity metrics. Use when tracking AI feature usage, measuring ROI, setting up dashboards, or analyzing Cascade effectiveness across your team. Trigger with phrases like "windsurf monitoring", "windsurf metrics", "windsurf analytics", "windsurf usage", "windsurf adoption".

webflow-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up observability for Webflow integrations — Prometheus metrics for API calls, OpenTelemetry tracing, structured logging with pino, Grafana dashboards, and alerting for rate limits, errors, and latency. Trigger with phrases like "webflow monitoring", "webflow metrics", "webflow observability", "monitor webflow", "webflow alerts", "webflow tracing".

vercel-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up Vercel observability with runtime logs, analytics, log drains, and OpenTelemetry tracing. Use when implementing monitoring for Vercel deployments, setting up log drains, or configuring alerting for function errors and performance. Trigger with phrases like "vercel monitoring", "vercel metrics", "vercel observability", "vercel logs", "vercel alerts", "vercel tracing".

veeva-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Veeva Vault observability for enterprise operations. Use when implementing advanced Veeva Vault patterns. Trigger: "veeva observability".

vastai-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor Vast.ai GPU instance health, utilization, and costs. Use when setting up monitoring dashboards, configuring alerts, or tracking GPU utilization and spending. Trigger with phrases like "vastai monitoring", "vastai metrics", "vastai observability", "monitor vastai", "vastai alerts".

twinmind-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor TwinMind transcription quality, meeting coverage, action item extraction rates, and memory vault health. Use when implementing observability, or managing TwinMind meeting AI operations. Trigger with phrases like "twinmind observability", "twinmind observability".

speak-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor Speak API health, assessment latency, session metrics, and pronunciation score distributions. Use when implementing observability, or managing Speak language learning platform operations. Trigger with phrases like "speak observability", "speak observability".

snowflake-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up Snowflake observability using ACCOUNT_USAGE views, alerts, and external monitoring. Use when implementing Snowflake monitoring dashboards, setting up query performance tracking, or configuring alerting for warehouse and pipeline health. Trigger with phrases like "snowflake monitoring", "snowflake metrics", "snowflake observability", "snowflake dashboard", "snowflake alerts".

shopify-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up observability for Shopify app integrations with query cost tracking, rate limit monitoring, webhook delivery metrics, and structured logging. Trigger with phrases like "shopify monitoring", "shopify metrics", "shopify observability", "monitor shopify API", "shopify alerts", "shopify dashboard".

salesforce-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up observability for Salesforce integrations with API limit monitoring, error tracking, and alerting. Use when implementing monitoring for Salesforce operations, tracking API consumption, or configuring alerting for Salesforce integration health. Trigger with phrases like "salesforce monitoring", "salesforce metrics", "salesforce observability", "monitor salesforce", "salesforce alerts", "salesforce API usage dashboard".

retellai-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Retell AI observability — AI voice agent and phone call automation. Use when working with Retell AI for voice agents, phone calls, or telephony. Trigger with phrases like "retell observability", "retellai-observability", "voice agent".

replit-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor Replit deployments with health checks, uptime tracking, resource usage, and alerting. Use when setting up monitoring for Replit apps, building health dashboards, or configuring alerting for deployment health and performance. Trigger with phrases like "replit monitoring", "replit metrics", "replit observability", "monitor replit", "replit alerts", "replit uptime".