Apache Arrow — Columnar Data Format

## Overview

25 stars

Best use case

Apache Arrow — Columnar Data Format is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

## Overview

Teams using Apache Arrow — Columnar Data Format should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/apache-arrow/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/apache-arrow/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/apache-arrow/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Apache Arrow — Columnar Data Format Compares

Feature / Agent	Apache Arrow — Columnar Data Format	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

## Overview

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Apache Arrow — Columnar Data Format


## Overview


Apache Arrow, the cross-language columnar memory format for analytics workloads. Helps developers use Arrow for high-performance data interchange between systems, zero-copy reads, and efficient columnar processing in Python (PyArrow) and JavaScript (Arrow JS).


## Instructions

### PyArrow — Python Interface

```python
# src/data/arrow_ops.py — High-performance data operations with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pyarrow.csv as pcsv

# Create Arrow tables from Python data
table = pa.table({
    "user_id": pa.array([1, 2, 3, 4, 5], type=pa.int64()),
    "name": pa.array(["Alice", "Bob", "Charlie", "Diana", "Eve"]),
    "revenue": pa.array([150.0, 320.5, 89.0, 1200.0, 45.5], type=pa.float64()),
    "signup_date": pa.array([
        "2026-01-15", "2026-01-20", "2026-02-01", "2026-02-10", "2026-03-01"
    ]).cast(pa.date32()),
    "is_active": pa.array([True, True, False, True, False]),
})

# Compute operations (vectorized, no Python loops)
high_value = pc.filter(table, pc.greater(table["revenue"], 100))
total_revenue = pc.sum(table["revenue"]).as_py()    # 1805.0
avg_revenue = pc.mean(table["revenue"]).as_py()     # 361.0
sorted_table = pc.sort_indices(table, sort_keys=[("revenue", "descending")])

# Read/write Parquet files (the standard format for Arrow data)
pq.write_table(table, "users.parquet", compression="zstd")
loaded = pq.read_table("users.parquet")

# Read with column selection and row filtering (pushdown to file)
subset = pq.read_table(
    "users.parquet",
    columns=["user_id", "revenue"],          # Only read these columns
    filters=[("revenue", ">", 100)],         # Predicate pushdown
)

# Read CSV with type inference
csv_table = pcsv.read_csv("data.csv", convert_options=pcsv.ConvertOptions(
    column_types={"amount": pa.float64(), "count": pa.int32()},
))

# Streaming reads for large files (process in batches)
parquet_file = pq.ParquetFile("large_dataset.parquet")
for batch in parquet_file.iter_batches(batch_size=10_000):
    # Process each batch (RecordBatch) without loading the full file
    filtered = pc.filter(batch, pc.greater(batch["amount"], 0))
    process_batch(filtered)
```

### Zero-Copy Interop

```python
# Arrow enables zero-copy conversion between libraries
import pyarrow as pa
import pandas as pd
import polars as pl

# Arrow → Pandas (zero-copy when possible)
arrow_table = pa.table({"x": [1, 2, 3], "y": [4.0, 5.0, 6.0]})
pandas_df = arrow_table.to_pandas()           # Near-instant for compatible types

# Pandas → Arrow
arrow_from_pandas = pa.Table.from_pandas(pandas_df)

# Arrow → Polars (zero-copy)
polars_df = pl.from_arrow(arrow_table)

# Polars → Arrow (zero-copy)
arrow_from_polars = polars_df.to_arrow()

# Arrow enables data exchange between:
# Python ↔ R (via reticulate)
# Python ↔ DuckDB (zero-copy)
# Python ↔ Spark (via PySpark)
# JavaScript ↔ WASM modules
```

### Partitioned Datasets

```python
# Work with partitioned datasets on disk or cloud storage
import pyarrow.dataset as ds

# Read a partitioned Parquet dataset (Hive-style partitioning)
# data/
#   year=2025/month=01/part-0.parquet
#   year=2025/month=02/part-0.parquet
#   year=2026/month=01/part-0.parquet

dataset = ds.dataset(
    "s3://my-bucket/events/",
    format="parquet",
    partitioning=ds.partitioning(
        pa.schema([
            ("year", pa.int32()),
            ("month", pa.int32()),
        ]),
        flavor="hive",
    ),
)

# Scan with partition pruning (only reads relevant files)
scanner = dataset.scanner(
    columns=["event_type", "user_id", "timestamp"],
    filter=(ds.field("year") == 2026) & (ds.field("month") >= 1),
)
table = scanner.to_table()

# Write partitioned dataset
ds.write_dataset(
    table,
    "output/events/",
    format="parquet",
    partitioning=ds.partitioning(
        pa.schema([("year", pa.int32()), ("month", pa.int32())]),
        flavor="hive",
    ),
    existing_data_behavior="overwrite_or_ignore",
)
```

### Arrow IPC (Inter-Process Communication)

```python
# Share data between processes without serialization overhead
import pyarrow as pa
import pyarrow.ipc as ipc

# Write Arrow IPC format (for streaming between processes)
table = pa.table({"id": [1, 2, 3], "value": [10.0, 20.0, 30.0]})

# File format (random access)
with pa.OSFile("data.arrow", "wb") as f:
    writer = ipc.new_file(f, table.schema)
    writer.write_table(table)
    writer.close()

# Stream format (append-only, lower overhead)
sink = pa.BufferOutputStream()
writer = ipc.new_stream(sink, table.schema)
writer.write_table(table)
writer.close()
buffer = sink.getvalue()    # bytes that can be sent over network/pipe

# Read back
reader = ipc.open_file("data.arrow")
loaded = reader.read_all()
```

### JavaScript (Arrow JS)

```typescript
// src/data/arrow-client.ts — Read Arrow data in the browser
import { tableFromIPC, tableToIPC } from "apache-arrow";

// Fetch Arrow IPC data from an API
async function fetchArrowData(url: string) {
  const response = await fetch(url);
  const buffer = await response.arrayBuffer();

  // Parse Arrow IPC format (zero-copy in WASM-backed implementations)
  const table = tableFromIPC(new Uint8Array(buffer));

  console.log(`Loaded ${table.numRows} rows, ${table.numCols} columns`);
  console.log("Schema:", table.schema.fields.map((f) => `${f.name}: ${f.type}`));

  // Access columns
  const ids = table.getChild("id");
  const values = table.getChild("value");

  // Iterate rows
  for (const row of table) {
    console.log(row.toJSON());  // { id: 1, value: 10.0 }
  }

  return table;
}

// Send Arrow data to a server
async function sendArrowData(url: string, table: any) {
  const buffer = tableToIPC(table);
  await fetch(url, {
    method: "POST",
    headers: { "Content-Type": "application/vnd.apache.arrow.stream" },
    body: buffer,
  });
}
```

## Installation

```bash
# Python
pip install pyarrow

# JavaScript
npm install apache-arrow

# With DuckDB (Arrow-native)
pip install duckdb    # DuckDB uses Arrow internally
```


## Examples


### Example 1: Integrating Apache Arrow into an existing application

**User request:**

```
Add Apache Arrow to my Next.js app for the AI chat feature. I want streaming responses.
```

The agent installs the SDK, creates an API route that initializes the Apache Arrow client, configures streaming, selects an appropriate model, and wires up the frontend to consume the stream. It handles error cases and sets up proper environment variable management for the API key.

### Example 2: Optimizing zero-copy interop performance

**User request:**

```
My Apache Arrow calls are slow and expensive. Help me optimize the setup.
```

The agent reviews the current implementation, identifies issues (wrong model selection, missing caching, inefficient prompting, no batching), and applies optimizations specific to Apache Arrow's capabilities — adjusting model parameters, adding response caching, and implementing retry logic with exponential backoff.


## Guidelines

1. **Parquet for storage, Arrow for compute** — Write Parquet to disk/S3; use Arrow in-memory for processing
2. **Column pruning** — Always specify `columns=` when reading Parquet; reading all columns wastes I/O and memory
3. **Predicate pushdown** — Use `filters=` in Parquet reads; the reader skips row groups that don't match
4. **Zero-copy when possible** — Use `to_pandas(self_destruct=True)` for large tables; Arrow can transfer memory ownership
5. **Batch processing for large files** — Use `iter_batches()` instead of reading entire files into memory
6. **IPC for microservices** — Arrow IPC is faster than JSON/CSV for data exchange between services
7. **Partitioned datasets for scale** — Partition by date/category; queries only scan relevant partitions
8. **DuckDB for Arrow queries** — DuckDB can query Arrow tables directly with zero copy: `duckdb.arrow(table).query("SELECT ...")`

Related Skills

College Football Data (CFB)

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

College Basketball Data (CBB)

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

validating-database-integrity

from ComeOnOliver/skillshub

Process use when you need to ensure database integrity through comprehensive data validation. This skill validates data types, ranges, formats, referential integrity, and business rules. Trigger with phrases like "validate database data", "implement data validation rules", "enforce data integrity constraints", or "validate data formats".

forecasting-time-series-data

from ComeOnOliver/skillshub

This skill enables Claude to forecast future values based on historical time series data. It analyzes time-dependent data to identify trends, seasonality, and other patterns. Use this skill when the user asks to predict future values of a time series, analyze trends in data over time, or requires insights into time-dependent data. Trigger terms include "forecast," "predict," "time series analysis," "future values," and requests involving temporal data.

generating-test-data

from ComeOnOliver/skillshub

This skill enables Claude to generate realistic test data for software development. It uses the test-data-generator plugin to create users, products, orders, and custom schemas for comprehensive testing. Use this skill when you need to populate databases, simulate user behavior, or create fixtures for automated tests. Trigger phrases include "generate test data", "create fake users", "populate database", "generate product data", "create test orders", or "generate data based on schema". This skill is especially useful for populating testing environments or creating sample data for demonstrations.

test-data-builder

from ComeOnOliver/skillshub

Test Data Builder - Auto-activating skill for Test Automation. Triggers on: test data builder, test data builder Part of the Test Automation skill category.

splitting-datasets

from ComeOnOliver/skillshub

Process split datasets into training, validation, and testing sets for ML model development. Use when requesting "split dataset", "train-test split", or "data partitioning". Trigger with relevant phrases based on skill purpose.

scanning-database-security

from ComeOnOliver/skillshub

Process use when you need to work with security and compliance. This skill provides security scanning and vulnerability detection with comprehensive guidance and automation. Trigger with phrases like "scan for vulnerabilities", "implement security controls", or "audit security".

preprocessing-data-with-automated-pipelines

from ComeOnOliver/skillshub

Process automate data cleaning, transformation, and validation for ML tasks. Use when requesting "preprocess data", "clean data", "ETL pipeline", or "data transformation". Trigger with relevant phrases based on skill purpose.

optimizing-database-connection-pooling

from ComeOnOliver/skillshub

Process use when you need to work with connection management. This skill provides connection pooling and management with comprehensive guidance and automation. Trigger with phrases like "manage connections", "configure pooling", or "optimize connection usage".

modeling-nosql-data

from ComeOnOliver/skillshub

This skill enables Claude to design NoSQL data models. It activates when the user requests assistance with NoSQL database design, including schema creation, data modeling for MongoDB or DynamoDB, or defining document structures. Use this skill when the user mentions "NoSQL data model", "design MongoDB schema", "create DynamoDB table", or similar phrases related to NoSQL database architecture. It assists in understanding NoSQL modeling principles like embedding vs. referencing, access pattern optimization, and sharding key selection.

monitoring-database-transactions

from ComeOnOliver/skillshub

Monitor use when you need to work with monitoring and observability. This skill provides health monitoring and alerting with comprehensive guidance and automation. Trigger with phrases like "monitor system health", "set up alerts", or "track metrics".