databricks-common-errors

Diagnose and fix Databricks common errors and exceptions. Use when encountering Databricks errors, debugging failed jobs, or troubleshooting cluster and notebook issues. Trigger with phrases like "databricks error", "fix databricks", "databricks not working", "debug databricks", "spark error".

1,868 stars

byjeremylongshore

View on GitHub Installation ↓

Best use case

databricks-common-errors is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using databricks-common-errors should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/databricks-common-errors/SKILL.md --create-dirs "https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/main/plugins/saas-packs/databricks-pack/skills/databricks-common-errors/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/databricks-common-errors/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How databricks-common-errors Compares

Feature / Agent	databricks-common-errors	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

SKILL.md Source

# Databricks Common Errors

## Overview
Quick-reference diagnostic guide for the most frequent Databricks errors. Covers cluster failures, Spark OOM, Delta Lake conflicts, permissions, schema mismatches, rate limits, and job run failures with real SDK/SQL solutions.

## Prerequisites
- Databricks CLI configured
- Access to cluster/job logs
- `databricks-sdk` installed for programmatic debugging

## Instructions

### Step 1: Identify the Error Source
```bash
# Get failed run details
databricks runs get --run-id $RUN_ID --output json | jq '{
  state: .state.result_state,
  message: .state.state_message,
  tasks: [.tasks[] | {key: .task_key, state: .state.result_state, error: .state.state_message}]
}'
```

### Step 2: Match and Fix

---

### CLUSTER_NOT_READY / INVALID_STATE
```
ClusterNotReadyException: Cluster 0123-456789-abcde is not in a RUNNING state
```
**Cause:** Cluster is starting, terminating, or in error state.

```python
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import State

w = WorkspaceClient()
cluster = w.clusters.get(cluster_id="0123-456789-abcde")

if cluster.state in (State.PENDING, State.RESTARTING):
    w.clusters.ensure_cluster_is_running("0123-456789-abcde")
elif cluster.state == State.TERMINATED:
    w.clusters.start_and_wait(cluster_id="0123-456789-abcde")
elif cluster.state == State.ERROR:
    reason = cluster.termination_reason
    print(f"Cluster error: {reason.code} — {reason.parameters}")
    # Common: CLOUD_PROVIDER_LAUNCH_FAILURE, INSTANCE_POOL_CLUSTER_FAILURE
```

---

### SPARK_DRIVER_OOM
```
java.lang.OutOfMemoryError: Java heap space
SparkException: Job aborted due to stage failure
```
**Cause:** Driver or executor running out of memory.

```python
# Fix 1: Increase memory via cluster Spark config
spark_conf = {
    "spark.driver.memory": "8g",
    "spark.executor.memory": "8g",
    "spark.sql.shuffle.partitions": "400",  # reduce skew
}

# Fix 2: Never collect() large datasets
# BAD:  all_data = df.collect()
# GOOD: df.write.format("delta").saveAsTable("catalog.schema.results")

# Fix 3: Broadcast small tables instead of shuffling
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup_df), "key")
```

---

### DELTA_CONCURRENT_WRITE
```
ConcurrentAppendException: Files were added by a concurrent update
ConcurrentDeleteReadException: A concurrent operation modified files
```
**Cause:** Multiple jobs writing to the same Delta table simultaneously.

```python
from delta.tables import DeltaTable
import time

def merge_with_retry(spark, source_df, target_table, merge_key, max_retries=3):
    """MERGE with retry for concurrent write conflicts."""
    for attempt in range(max_retries):
        try:
            target = DeltaTable.forName(spark, target_table)
            (target.alias("t")
                .merge(source_df.alias("s"), f"t.{merge_key} = s.{merge_key}")
                .whenMatchedUpdateAll()
                .whenNotMatchedInsertAll()
                .execute())
            return
        except Exception as e:
            if "Concurrent" in str(e) and attempt < max_retries - 1:
                time.sleep(2 ** attempt)
                continue
            raise
```

---

### PERMISSION_DENIED
```
PERMISSION_DENIED: User does not have SELECT on TABLE catalog.schema.table
PermissionDeniedException: User does not have permission MANAGE on cluster
```
**Cause:** Missing Unity Catalog grants or workspace permissions.

```sql
-- Fix Unity Catalog permissions (requires GRANT privilege)
GRANT USAGE ON CATALOG analytics TO `data-team`;
GRANT USAGE ON SCHEMA analytics.silver TO `data-team`;
GRANT SELECT ON TABLE analytics.silver.orders TO `data-team`;

-- Check current grants
SHOW GRANTS ON TABLE analytics.silver.orders;
```

```bash
# Fix workspace object permissions
databricks permissions update jobs --job-id 123 --json '{
  "access_control_list": [{
    "user_name": "user@company.com",
    "permission_level": "CAN_MANAGE_RUN"
  }]
}'
```

---

### INVALID_PARAMETER_VALUE
```
InvalidParameterValue: Instance type xyz not supported in region us-east-1
Invalid spark_version: 13.x.x-scala2.12
```
**Cause:** Wrong cluster config for the workspace region.

```python
w = WorkspaceClient()

# List valid node types for this workspace
for nt in sorted(w.clusters.list_node_types().node_types, key=lambda x: x.memory_mb)[:10]:
    print(f"{nt.node_type_id}: {nt.memory_mb}MB, {nt.num_cores} cores")

# List valid Spark versions
for v in w.clusters.spark_versions().versions:
    if "LTS" in v.name:
        print(f"{v.key}: {v.name}")
```

---

### SCHEMA_MISMATCH
```
AnalysisException: A schema mismatch detected when writing to the Delta table
```
**Cause:** Source schema doesn't match target table.

```python
# Option 1: Enable schema evolution
df.write.format("delta").option("mergeSchema", "true").mode("append").saveAsTable("target")

# Option 2: Identify differences
source_cols = set(df.columns)
target_cols = set(spark.table("target").columns)
print(f"Missing in source: {target_cols - source_cols}")
print(f"Extra in source: {source_cols - target_cols}")

# Option 3: Cast to match target schema
target_schema = spark.table("target").schema
for field in target_schema:
    if field.name in df.columns:
        df = df.withColumn(field.name, col(field.name).cast(field.dataType))
```

---

### JOB_RUN_FAILED
```
RunState: FAILED — Run terminated with error
```

```python
w = WorkspaceClient()
run = w.jobs.get_run(run_id=12345)

print(f"State: {run.state.life_cycle_state}")
print(f"Result: {run.state.result_state}")
print(f"Message: {run.state.state_message}")

# Check each task
for task in run.tasks:
    if task.state.result_state and task.state.result_state.value == "FAILED":
        output = w.jobs.get_run_output(task.run_id)
        print(f"Task '{task.task_key}' failed: {output.error}")
        if output.error_trace:
            print(f"Traceback:\n{output.error_trace[:500]}")
```

---

### HTTP 429 — RATE_LIMIT_EXCEEDED
See `databricks-rate-limits` skill for full retry patterns.

```python
from databricks.sdk.errors import TooManyRequests
import time

def call_with_backoff(operation, max_retries=5):
    for attempt in range(max_retries):
        try:
            return operation()
        except TooManyRequests as e:
            wait = e.retry_after_secs or (2 ** attempt)
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
    raise RuntimeError("Max retries exceeded")
```

## Output
- Error identified and categorized
- Fix applied from matching error pattern
- Resolution verified

## Error Handling
| Error Code | HTTP | Category | Quick Fix |
|-----------|------|----------|-----------|
| `CLUSTER_NOT_READY` | - | Compute | `ensure_cluster_is_running()` |
| `OutOfMemoryError` | - | Spark | Increase memory, avoid `.collect()` |
| `ConcurrentAppendException` | - | Delta | MERGE with retry, serialize writes |
| `PERMISSION_DENIED` | 403 | Auth | `GRANT` in Unity Catalog |
| `INVALID_PARAMETER_VALUE` | 400 | Config | Check `list_node_types()` |
| `AnalysisException` | - | Schema | `mergeSchema=true` |
| `FAILED` run state | - | Job | Check `get_run_output()` for traceback |
| `Too Many Requests` | 429 | Rate Limit | Exponential backoff with `Retry-After` |

## Examples

### Quick Diagnostic Commands
```bash
databricks clusters get --cluster-id $CID | jq '{state, termination_reason}'
databricks runs list --job-id $JID --limit 5 | jq '.runs[] | {run_id, state: .state.result_state}'
databricks permissions get jobs --job-id $JID
```

### Escalation Path
1. Check [Databricks Status](https://status.databricks.com)
2. Collect evidence with `databricks-debug-bundle`
3. Search [Community Forum](https://community.databricks.com)
4. Contact support with workspace ID and request ID from error response

## Resources
- [Troubleshooting Guide](https://docs.databricks.com/aws/en/resources/troubleshooting)
- [Delta Lake Troubleshooting](https://docs.databricks.com/aws/en/delta/best-practices)
- [Resource Limits](https://docs.databricks.com/aws/en/resources/limits)

## Next Steps
For comprehensive debugging, see `databricks-debug-bundle`.

Related Skills

workhuman-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Workhuman common errors for employee recognition and rewards API. Use when integrating Workhuman Social Recognition, or building recognition workflows with HRIS systems. Trigger: "workhuman common errors".

wispr-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Wispr Flow common errors for voice-to-text API integration. Use when integrating Wispr Flow dictation, WebSocket streaming, or building voice-powered applications. Trigger: "wispr common errors".

windsurf-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix common Windsurf IDE and Cascade errors. Use when Cascade stops working, Supercomplete fails, indexing hangs, or encountering Windsurf-specific issues. Trigger with phrases like "windsurf error", "fix windsurf", "windsurf not working", "cascade broken", "windsurf slow".

webflow-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix Webflow Data API v2 errors — 400, 401, 403, 404, 409, 429, 500. Use when encountering Webflow API errors, debugging failed requests, or troubleshooting integration issues. Trigger with phrases like "webflow error", "fix webflow", "webflow not working", "debug webflow", "webflow 429", "webflow 401".

vercel-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix common Vercel deployment and function errors. Use when encountering Vercel errors, debugging failed deployments, or troubleshooting serverless function issues. Trigger with phrases like "vercel error", "fix vercel", "vercel not working", "debug vercel", "vercel 500", "vercel build failed".

veeva-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Veeva Vault common errors for REST API and clinical operations. Use when working with Veeva Vault document management and CRM. Trigger: "veeva common errors".

vastai-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix Vast.ai common errors and exceptions. Use when encountering Vast.ai errors, debugging failed instances, or troubleshooting GPU rental issues. Trigger with phrases like "vastai error", "fix vastai", "vastai not working", "debug vastai", "vastai instance failed".

twinmind-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix TwinMind common errors and exceptions. Use when encountering transcription errors, debugging failed requests, or troubleshooting integration issues. Trigger with phrases like "twinmind error", "fix twinmind", "twinmind not working", "debug twinmind", "transcription failed".

together-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Together AI common errors for inference, fine-tuning, and model deployment. Use when working with Together AI's OpenAI-compatible API. Trigger: "together common errors".

techsmith-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

TechSmith common errors for Snagit COM API and Camtasia automation. Use when working with TechSmith screen capture and video editing automation. Trigger: "techsmith common errors".

supabase-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix Supabase errors across PostgREST, PostgreSQL, Auth, Storage, and Realtime. Use when encountering error codes like PGRST301, 42501, 23505, or auth failures. Use when debugging failed queries, RLS policy violations, or HTTP 4xx/5xx responses. Trigger with "supabase error", "fix supabase", "PGRST", "supabase 403", "RLS not working", "supabase auth error", "unique constraint", "foreign key violation".

stackblitz-common-errors

1868

from jeremylongshore/claude-code-plugins-plus-skills

Fix WebContainer and StackBlitz errors: COOP/COEP, SharedArrayBuffer, boot failures. Use when WebContainers fail to boot, embeds don't load, or processes crash inside WebContainers. Trigger: "stackblitz error", "webcontainer error", "SharedArrayBuffer not defined".