spark

Apache Spark distributed computing. Use for big data processing.

7 stars

byG1Joshi

View on GitHub Installation ↓

Best use case

spark is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Apache Spark distributed computing. Use for big data processing.

Teams using spark should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/spark/SKILL.md --create-dirs "https://raw.githubusercontent.com/G1Joshi/Agent-Skills/main/skills/ai-ml/spark/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/spark/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How spark Compares

Feature / Agent	spark	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Apache Spark distributed computing. Use for big data processing.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Apache Spark

Spark is the king of Big Data. v4.0 (2024/2025) makes **Spark Connect** the default, allowing thin clients (like VS Code) to connect to massive clusters easily.

## When to Use

- **Data Engineering**: ETL at Petabyte scale.
- **Streaming**: Structured Streaming for real-time analytics.
- **Legacy ML**: `spark.ml` (though mostly replaced by XGBoost/Torch).

## Core Concepts

### Spark Connect

Decouples client (your laptop) from server (the cluster). Allows using Spark from Go/Rust/TypeScript.

### Catalyst Optimizer

Optimizes your SQL/DataFrame queries before execution.

### RDD

The low-level API. Almost never used directly in modern Spark.

## Best Practices (2025)

**Do**:

- **Use PySpark**: It is now a first-class citizen with Python UDF profiling.
- **Use Delta Lake / Iceberg**: Spark works best with modern table formats.
- **Use `pandas_udf`**: For vectorized Python UDFs.

**Don't**:

- **Don't use `rdd.map`**: It is slow (Python serialization). Use DataFrames.

## References

- [Apache Spark](https://spark.apache.org/)