Apache Spark

## Overview

25 stars

Best use case

Apache Spark is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

## Overview

Teams using Apache Spark should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/apache-spark/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/apache-spark/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/apache-spark/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How Apache Spark Compares

Feature / AgentApache SparkStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

## Overview

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Apache Spark

## Overview

Apache Spark is the standard for distributed data processing. It handles batch processing, streaming, SQL, machine learning, and graph processing. PySpark provides a Python API. Runs on standalone clusters, YARN, Kubernetes, or managed services (Databricks, EMR, Dataproc).

## Instructions

### Step 1: PySpark Setup

```bash
pip install pyspark
```

### Step 2: DataFrame Operations

```python
# etl/process.py — PySpark data processing
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("DataPipeline") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Read data
df = spark.read.parquet("s3://bucket/raw/events/")

# Transform
processed = (df
    .filter(F.col("event_type").isin(["purchase", "signup"]))
    .withColumn("date", F.to_date("timestamp"))
    .withColumn("revenue", F.col("amount") * F.col("quantity"))
    .groupBy("date", "event_type")
    .agg(
        F.count("*").alias("event_count"),
        F.sum("revenue").alias("total_revenue"),
        F.countDistinct("user_id").alias("unique_users"),
    )
    .orderBy("date")
)

# Write results
processed.write \
    .mode("overwrite") \
    .partitionBy("date") \
    .parquet("s3://bucket/processed/daily_metrics/")
```

### Step 3: SQL Interface

```python
# Register as SQL table
df.createOrReplaceTempView("events")

result = spark.sql("""
    SELECT
        date_trunc('month', timestamp) as month,
        COUNT(DISTINCT user_id) as monthly_active_users,
        SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) as revenue
    FROM events
    WHERE timestamp >= '2025-01-01'
    GROUP BY 1
    ORDER BY 1
""")
result.show()
```

### Step 4: Structured Streaming

```python
# Real-time processing from Kafka
stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "events") \
    .load()

parsed = stream.select(
    F.from_json(F.col("value").cast("string"), schema).alias("data")
).select("data.*")

query = parsed \
    .groupBy(F.window("timestamp", "5 minutes"), "event_type") \
    .count() \
    .writeStream \
    .outputMode("update") \
    .format("console") \
    .start()
```

## Guidelines

- Use DataFrames (not RDDs) for most work — they're optimized by Catalyst query optimizer.
- Partitioning is critical for performance — partition by date or high-cardinality columns.
- For managed Spark, consider Databricks (easiest), AWS EMR, or GCP Dataproc.
- PySpark syntax mirrors Pandas but executes distributed — think in columns, not rows.

Related Skills

spark-sql-optimizer

25
from ComeOnOliver/skillshub

Spark Sql Optimizer - Auto-activating skill for Data Pipelines. Triggers on: spark sql optimizer, spark sql optimizer Part of the Data Pipelines skill category.

spark-job-creator

25
from ComeOnOliver/skillshub

Spark Job Creator - Auto-activating skill for Data Pipelines. Triggers on: spark job creator, spark job creator Part of the Data Pipelines skill category.

pyspark-transformer

25
from ComeOnOliver/skillshub

Pyspark Transformer - Auto-activating skill for Data Pipelines. Triggers on: pyspark transformer, pyspark transformer Part of the Data Pipelines skill category.

spark-optimization

25
from ComeOnOliver/skillshub

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

generate-sparkle-appcast

25
from ComeOnOliver/skillshub

Generate Mos Sparkle appcast.xml from the latest build zip and recent git changes (since a given commit), then sync to docs/ for publishing.

super-swarm-spark

25
from ComeOnOliver/skillshub

Only to be triggered by explicit super-swarm-spark commands.

parallel-task-spark

25
from ComeOnOliver/skillshub

Only to be triggered by explicit /parallel-task-spark commands.

Apache Kafka

25
from ComeOnOliver/skillshub

## Overview

KafkaJS — Apache Kafka Client for Node.js

25
from ComeOnOliver/skillshub

You are an expert in KafkaJS, the pure JavaScript Apache Kafka client for Node.js. You help developers build event-driven architectures with producers, consumers, consumer groups, exactly-once semantics, SASL authentication, and admin operations — processing millions of events per second for real-time analytics, event sourcing, log aggregation, and microservices communication.

Apache Flink

25
from ComeOnOliver/skillshub

## Overview

Apache Arrow — Columnar Data Format

25
from ComeOnOliver/skillshub

## Overview

sendspark-automation

25
from ComeOnOliver/skillshub

Automate Sendspark tasks via Rube MCP (Composio). Always search tools first for current schemas.