parquet-optimization

Proactively analyzes Parquet file operations and suggests optimization improvements for compression, encoding, row group sizing, and statistics. Activates when users are reading or writing Parquet files or discussing Parquet performance.

242 stars

byaiskillstore

View on GitHub Installation ↓

Best use case

parquet-optimization is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Proactively analyzes Parquet file operations and suggests optimization improvements for compression, encoding, row group sizing, and statistics. Activates when users are reading or writing Parquet files or discussing Parquet performance.

Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.

Practical example

Example input

Use the "parquet-optimization" skill to help with this workflow task. Context: Proactively analyzes Parquet file operations and suggests optimization improvements for compression, encoding, row group sizing, and statistics. Activates when users are reading or writing Parquet files or discussing Parquet performance.

Example output

A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.

When to use this skill

Use this skill when you want a reusable workflow rather than writing the same prompt again and again.

When not to use this skill

Do not use this when you only need a one-off answer and do not need a reusable workflow.
Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/parquet-optimization/SKILL.md --create-dirs "https://raw.githubusercontent.com/aiskillstore/marketplace/main/skills/emillindfors/parquet-optimization/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/parquet-optimization/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How parquet-optimization Compares

Feature / Agent	parquet-optimization	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Parquet Optimization Skill

You are an expert at optimizing Parquet file operations for performance and efficiency. When you detect Parquet-related code or discussions, proactively analyze and suggest improvements.

## When to Activate

Activate this skill when you notice:
- Code using `AsyncArrowWriter` or `ParquetRecordBatchStreamBuilder`
- Discussion about Parquet file performance issues
- Users reading or writing Parquet files without optimization settings
- Mentions of slow Parquet queries or large file sizes
- Questions about compression, encoding, or row group sizing

## Optimization Checklist

When you see Parquet operations, check for these optimizations:

### Writing Parquet Files

**1. Compression Settings**
- ✅ GOOD: `Compression::ZSTD(ZstdLevel::try_new(3)?)`
- ❌ BAD: No compression specified (uses default)
- 🔍 LOOK FOR: Missing `.set_compression()` in WriterProperties

**Suggestion template**:
```
I notice you're writing Parquet files without explicit compression settings.
For production data lakes, I recommend:

WriterProperties::builder()
    .set_compression(Compression::ZSTD(ZstdLevel::try_new(3)?))
    .build()

This provides 3-4x compression with minimal CPU overhead.
```

**2. Row Group Sizing**
- ✅ GOOD: 100MB - 1GB uncompressed (100_000_000 rows)
- ❌ BAD: Default or very small row groups
- 🔍 LOOK FOR: Missing `.set_max_row_group_size()`

**Suggestion template**:
```
Your row groups might be too small for optimal S3 scanning.
Target 100MB-1GB uncompressed:

WriterProperties::builder()
    .set_max_row_group_size(100_000_000)
    .build()

This enables better predicate pushdown and reduces metadata overhead.
```

**3. Statistics Enablement**
- ✅ GOOD: `.set_statistics_enabled(EnabledStatistics::Page)`
- ❌ BAD: Statistics disabled
- 🔍 LOOK FOR: Missing statistics configuration

**Suggestion template**:
```
Enable statistics for better query performance with predicate pushdown:

WriterProperties::builder()
    .set_statistics_enabled(EnabledStatistics::Page)
    .build()

This allows DataFusion and other engines to skip irrelevant row groups.
```

**4. Column-Specific Settings**
- ✅ GOOD: Dictionary encoding for low-cardinality columns
- ❌ BAD: Same settings for all columns
- 🔍 LOOK FOR: No column-specific configurations

**Suggestion template**:
```
For low-cardinality columns like 'category' or 'status', use dictionary encoding:

WriterProperties::builder()
    .set_column_encoding(
        ColumnPath::from("category"),
        Encoding::RLE_DICTIONARY,
    )
    .set_column_compression(
        ColumnPath::from("category"),
        Compression::SNAPPY,
    )
    .build()
```

### Reading Parquet Files

**1. Column Projection**
- ✅ GOOD: `.with_projection(ProjectionMask::roots(...))`
- ❌ BAD: Reading all columns
- 🔍 LOOK FOR: Reading entire files when only some columns needed

**Suggestion template**:
```
Reading all columns is inefficient. Use projection to read only what you need:

let projection = ProjectionMask::roots(&schema, vec![0, 2, 5]);
builder.with_projection(projection)

This can provide 10x+ speedup for wide tables.
```

**2. Batch Size Tuning**
- ✅ GOOD: `.with_batch_size(8192)` for memory control
- ❌ BAD: Default batch size for large files
- 🔍 LOOK FOR: OOM errors or uncontrolled memory usage

**Suggestion template**:
```
For large files, control memory usage with batch size tuning:

builder.with_batch_size(8192)

Adjust based on your memory constraints and throughput needs.
```

**3. Row Group Filtering**
- ✅ GOOD: Using statistics to filter row groups
- ❌ BAD: Reading all row groups
- 🔍 LOOK FOR: Missing row group filtering logic

**Suggestion template**:
```
You can skip irrelevant row groups using statistics:

let row_groups: Vec<usize> = builder.metadata()
    .row_groups()
    .iter()
    .enumerate()
    .filter_map(|(idx, rg)| {
        // Check statistics
        if matches_criteria(rg.column(0).statistics()) {
            Some(idx)
        } else {
            None
        }
    })
    .collect();

builder.with_row_groups(row_groups)
```

**4. Streaming vs Collecting**
- ✅ GOOD: Streaming with `while let Some(batch) = stream.next()`
- ❌ BAD: `.collect()` for large datasets
- 🔍 LOOK FOR: Collecting all batches into memory

**Suggestion template**:
```
For large files, stream batches instead of collecting:

let mut stream = builder.build()?;
while let Some(batch) = stream.next().await {
    let batch = batch?;
    process_batch(&batch)?;
    // Batch is dropped here, freeing memory
}
```

## Performance Guidelines

### Compression Selection Guide

**For hot data (frequently accessed)**:
- Use Snappy: Fast decompression, 2-3x compression
- Good for: Real-time analytics, frequently queried tables

**For warm data (balanced)**:
- Use ZSTD(3): Balanced performance, 3-4x compression
- Good for: Production data lakes (recommended default)

**For cold data (archival)**:
- Use ZSTD(6-9): Max compression, 5-6x compression
- Good for: Long-term storage, compliance archives

### File Sizing Guide

**Target file sizes**:
- Individual files: 100MB - 1GB compressed
- Row groups: 100MB - 1GB uncompressed
- Batches: 8192 - 65536 rows

**Why?**
- Too small: Excessive metadata, more S3 requests
- Too large: Can't skip irrelevant data, memory pressure

## Common Issues to Detect

### Issue 1: Small Files Problem
**Symptoms**: Many files < 10MB
**Solution**: Suggest batching writes or file compaction

```
I notice you're writing many small Parquet files. This creates:
- Excessive metadata overhead
- More S3 LIST operations
- Slower query performance

Consider batching your writes or implementing periodic compaction.
```

### Issue 2: No Partitioning
**Symptoms**: All data in single directory
**Solution**: Suggest Hive-style partitioning

```
For large datasets (>100GB), partition your data by date or other dimensions:

data/events/year=2024/month=01/day=15/part-00000.parquet

This enables partition pruning for much faster queries.
```

### Issue 3: Wrong Compression
**Symptoms**: Uncompressed or LZ4/Gzip
**Solution**: Recommend ZSTD

```
LZ4/Gzip are older codecs. ZSTD provides better compression and speed:

Compression::ZSTD(ZstdLevel::try_new(3)?)

This is the recommended default for cloud data lakes.
```

### Issue 4: Missing Error Handling
**Symptoms**: No retry logic for object store operations
**Solution**: Add retry configuration

```
Parquet operations on cloud storage need retry logic:

let s3 = AmazonS3Builder::new()
    .with_retry(RetryConfig {
        max_retries: 3,
        retry_timeout: Duration::from_secs(10),
        ..Default::default()
    })
    .build()?;
```

## Examples of Good Optimization

### Example 1: Production Writer
```rust
let props = WriterProperties::builder()
    .set_writer_version(WriterVersion::PARQUET_2_0)
    .set_compression(Compression::ZSTD(ZstdLevel::try_new(3)?))
    .set_max_row_group_size(100_000_000)
    .set_data_page_size_limit(1024 * 1024)
    .set_dictionary_enabled(true)
    .set_statistics_enabled(EnabledStatistics::Page)
    .build();

let mut writer = AsyncArrowWriter::try_new(writer_obj, schema, Some(props))?;
```

### Example 2: Optimized Reader
```rust
let projection = ProjectionMask::roots(&schema, vec![0, 2, 5]);

let builder = ParquetRecordBatchStreamBuilder::new(reader)
    .await?
    .with_projection(projection)
    .with_batch_size(8192);

let mut stream = builder.build()?;
while let Some(batch) = stream.next().await {
    let batch = batch?;
    process_batch(&batch)?;
}
```

## Your Approach

1. **Detect**: Identify Parquet operations in code or discussion
2. **Analyze**: Check against optimization checklist
3. **Suggest**: Provide specific, actionable improvements
4. **Explain**: Include the "why" behind recommendations
5. **Prioritize**: Focus on high-impact optimizations first

## Communication Style

- Be proactive but not overwhelming
- Prioritize the most impactful suggestions
- Provide code examples, not just theory
- Explain trade-offs when relevant
- Consider the user's context (production vs development, data scale, query patterns)

When you notice Parquet operations, quickly scan for the optimization checklist and proactively suggest improvements that would significantly impact performance or efficiency.

Related Skills

web-performance-optimization

242

from aiskillstore/marketplace

Optimize website and web application performance including loading speed, Core Web Vitals, bundle size, caching strategies, and runtime performance

sql-optimization-patterns

242

from aiskillstore/marketplace

Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.

spark-optimization

242

from aiskillstore/marketplace

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

postgresql-optimization

242

from aiskillstore/marketplace

PostgreSQL database optimization workflow for query tuning, indexing strategies, performance analysis, and production database management.

database-cloud-optimization-cost-optimize

242

from aiskillstore/marketplace

You are a cloud cost optimization expert specializing in reducing infrastructure expenses while maintaining performance and reliability. Analyze cloud spending, identify savings opportunities, and implement cost-effective architectures across AWS, Azure, and GCP.

cost-optimization

242

from aiskillstore/marketplace

Optimize cloud costs through resource rightsizing, tagging strategies, reserved instances, and spending analysis. Use when reducing cloud expenses, analyzing infrastructure costs, or implementing cost governance policies.

bazel-build-optimization

242

from aiskillstore/marketplace

Optimize Bazel builds for large-scale monorepos. Use when configuring Bazel, implementing remote execution, or optimizing build performance for enterprise codebases.

application-performance-performance-optimization

242

from aiskillstore/marketplace

Optimize end-to-end application performance with profiling, observability, and backend/frontend tuning. Use when coordinating performance optimization across the stack.

azure-cost-optimization

242

from aiskillstore/marketplace

Identify and quantify cost savings across Azure subscriptions by analyzing actual costs, utilization metrics, and generating actionable optimization recommendations. USE FOR: optimize Azure costs, reduce Azure spending, reduce Azure expenses, analyze Azure costs, find cost savings, generate cost optimization report, find orphaned resources, rightsize VMs, cost analysis, reduce waste, Azure spending analysis, find unused resources, optimize Redis costs. DO NOT USE FOR: deploying resources (use azure-deploy), general Azure diagnostics (use azure-diagnostics), security issues (use azure-security)

lambda-optimization-advisor

242

from aiskillstore/marketplace

Reviews AWS Lambda functions for performance, memory configuration, and cost optimization. Activates when users write Lambda handlers or discuss Lambda performance.

when-optimizing-prompts-use-prompt-optimization-analyzer

242

from aiskillstore/marketplace

Active diagnostic tool for analyzing prompt quality, detecting anti-patterns, identifying token waste, and providing optimization recommendations

agentdb-vector-search-optimization

242

from aiskillstore/marketplace

Optimize AgentDB vector search performance using quantization for 4-32x memory reduction, HNSW indexing for 150x faster search, caching, and batch operations for scaling to millions of vectors.