Best use case
Apache Spark is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
## Overview
Teams using Apache Spark should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/apache-spark/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Apache Spark Compares
| Feature / Agent | Apache Spark | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
## Overview
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Apache Spark
## Overview
Apache Spark is the standard for distributed data processing. It handles batch processing, streaming, SQL, machine learning, and graph processing. PySpark provides a Python API. Runs on standalone clusters, YARN, Kubernetes, or managed services (Databricks, EMR, Dataproc).
## Instructions
### Step 1: PySpark Setup
```bash
pip install pyspark
```
### Step 2: DataFrame Operations
```python
# etl/process.py — PySpark data processing
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName("DataPipeline") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
# Read data
df = spark.read.parquet("s3://bucket/raw/events/")
# Transform
processed = (df
.filter(F.col("event_type").isin(["purchase", "signup"]))
.withColumn("date", F.to_date("timestamp"))
.withColumn("revenue", F.col("amount") * F.col("quantity"))
.groupBy("date", "event_type")
.agg(
F.count("*").alias("event_count"),
F.sum("revenue").alias("total_revenue"),
F.countDistinct("user_id").alias("unique_users"),
)
.orderBy("date")
)
# Write results
processed.write \
.mode("overwrite") \
.partitionBy("date") \
.parquet("s3://bucket/processed/daily_metrics/")
```
### Step 3: SQL Interface
```python
# Register as SQL table
df.createOrReplaceTempView("events")
result = spark.sql("""
SELECT
date_trunc('month', timestamp) as month,
COUNT(DISTINCT user_id) as monthly_active_users,
SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) as revenue
FROM events
WHERE timestamp >= '2025-01-01'
GROUP BY 1
ORDER BY 1
""")
result.show()
```
### Step 4: Structured Streaming
```python
# Real-time processing from Kafka
stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "events") \
.load()
parsed = stream.select(
F.from_json(F.col("value").cast("string"), schema).alias("data")
).select("data.*")
query = parsed \
.groupBy(F.window("timestamp", "5 minutes"), "event_type") \
.count() \
.writeStream \
.outputMode("update") \
.format("console") \
.start()
```
## Guidelines
- Use DataFrames (not RDDs) for most work — they're optimized by Catalyst query optimizer.
- Partitioning is critical for performance — partition by date or high-cardinality columns.
- For managed Spark, consider Databricks (easiest), AWS EMR, or GCP Dataproc.
- PySpark syntax mirrors Pandas but executes distributed — think in columns, not rows.Related Skills
spark-sql-optimizer
Spark Sql Optimizer - Auto-activating skill for Data Pipelines. Triggers on: spark sql optimizer, spark sql optimizer Part of the Data Pipelines skill category.
spark-job-creator
Spark Job Creator - Auto-activating skill for Data Pipelines. Triggers on: spark job creator, spark job creator Part of the Data Pipelines skill category.
pyspark-transformer
Pyspark Transformer - Auto-activating skill for Data Pipelines. Triggers on: pyspark transformer, pyspark transformer Part of the Data Pipelines skill category.
spark-optimization
Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.
generate-sparkle-appcast
Generate Mos Sparkle appcast.xml from the latest build zip and recent git changes (since a given commit), then sync to docs/ for publishing.
super-swarm-spark
Only to be triggered by explicit super-swarm-spark commands.
parallel-task-spark
Only to be triggered by explicit /parallel-task-spark commands.
Apache Kafka
## Overview
KafkaJS — Apache Kafka Client for Node.js
You are an expert in KafkaJS, the pure JavaScript Apache Kafka client for Node.js. You help developers build event-driven architectures with producers, consumers, consumer groups, exactly-once semantics, SASL authentication, and admin operations — processing millions of events per second for real-time analytics, event sourcing, log aggregation, and microservices communication.
Apache Flink
## Overview
Apache Arrow — Columnar Data Format
## Overview
sendspark-automation
Automate Sendspark tasks via Rube MCP (Composio). Always search tools first for current schemas.