polars

Polars DataFrame library for high-performance data manipulation. Lazy/eager execution, expressions, I/O (CSV, Parquet, JSON), aggregations, joins, string/datetime ops, pandas interop. Use for Polars DataFrames or reading/writing Parquet files.

Best use case

polars is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Polars DataFrame library for high-performance data manipulation. Lazy/eager execution, expressions, I/O (CSV, Parquet, JSON), aggregations, joins, string/datetime ops, pandas interop. Use for Polars DataFrames or reading/writing Parquet files.

Teams using polars should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/polars/SKILL.md --create-dirs "https://raw.githubusercontent.com/DAAF-Contribution-Community/daaf/main/.claude/skills/polars/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/polars/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How polars Compares

Feature / AgentpolarsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Polars DataFrame library for high-performance data manipulation. Lazy/eager execution, expressions, I/O (CSV, Parquet, JSON), aggregations, joins, string/datetime ops, pandas interop. Use for Polars DataFrames or reading/writing Parquet files.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Polars Skill

Polars DataFrame library for high-performance data manipulation in Python. Covers lazy/eager execution, expressions, I/O (CSV, Parquet, JSON, database), aggregations, joins, string/datetime operations, pandas/NumPy interop, and performance optimization. Use when working with Polars DataFrames, migrating from pandas, reading Parquet files, or optimizing data pipeline performance.

Comprehensive skill for high-performance data manipulation with Polars. Use decision trees below to find the right guidance, then load detailed references.

## What is Polars?

Polars is a **fast** DataFrame library for Python (and Rust):
- **Fast**: Written in Rust, optimized for modern CPUs with SIMD and parallelism
- **Lazy Evaluation**: Build query plans that get optimized before execution
- **Expressive**: Powerful expression API for complex transformations
- **Memory Efficient**: Columnar format, streaming for larger-than-memory data
- **No Dependencies**: Pure Rust core, no NumPy/Pandas required

## Version Notes

This skill targets **Polars 1.x** (tested with 1.37.1). Key changes from 0.x:
- `apply` renamed to `map_elements` (0.19+)
- `groupby` renamed to `group_by` (0.19+)
- `melt` renamed to `unpivot` (1.0+)
- Streaming engine improvements in 1.x
- `pl.Utf8` is now `pl.String` (1.0+, Utf8 still works as alias)

## How to Use This Skill

### Reference File Structure

Each topic in `./references/` contains focused documentation:

| File | Purpose | When to Read |
|------|---------|--------------|
| `quickstart.md` | Installation, concepts, first DataFrame | Starting with Polars |
| `dataframes-series.md` | Creation, selection, filtering, modification | Basic data manipulation |
| `io-data.md` | CSV, Parquet, JSON, database I/O | Loading/saving data |
| `expressions.md` | Expression system, contexts, chaining | Understanding Polars idioms |
| `aggregations-grouping.md` | GroupBy, window functions, statistics | Summarizing data |
| `joins-concat.md` | Joins, concatenation, pivot/unpivot | Combining DataFrames |
| `strings-datetime-categorical.md` | String ops, datetime, categoricals | Type-specific operations |
| `performance.md` | Lazy execution, optimization, anti-patterns | Making code faster |
| `interop.md` | Pandas, NumPy, PyArrow, DuckDB | Working with other tools |
| `gotchas.md` | Common errors, anti-patterns, migration | Debugging issues |

### Reading Order

1. **New to Polars?** Start with `quickstart.md` then `expressions.md`
2. **Coming from Pandas?** Read `quickstart.md`, `expressions.md`, then `interop.md`
3. **Performance issues?** Check `performance.md` first

## Quick Decision Trees

### "I need to get started"

```
Getting started?
├─ Install Polars → ./references/quickstart.md
├─ Create first DataFrame → ./references/quickstart.md
├─ Understand lazy vs eager → ./references/quickstart.md
├─ Learn expression syntax → ./references/expressions.md
└─ Coming from Pandas → ./references/interop.md
```

### "I need to load or save data"

```
Loading/saving data?
├─ Read CSV file → ./references/io-data.md
├─ Read Parquet (recommended) → ./references/io-data.md
├─ Read JSON/NDJSON → ./references/io-data.md
├─ Read from database → ./references/io-data.md
├─ Read multiple files (glob) → ./references/io-data.md
├─ Write to file → ./references/io-data.md
└─ Larger-than-memory data → ./references/performance.md
```

### "I need to filter or select data"

```
Filtering/selecting?
├─ Select columns by name → ./references/dataframes-series.md
├─ Select by pattern/regex → ./references/dataframes-series.md
├─ Select by data type → ./references/dataframes-series.md
├─ Filter rows by condition → ./references/dataframes-series.md
├─ Filter with multiple conditions → ./references/dataframes-series.md
├─ Handle null values → ./references/dataframes-series.md
└─ Add/modify columns → ./references/dataframes-series.md
```

### "I need to aggregate or group data"

```
Aggregating data?
├─ Basic statistics (sum, mean, etc.) → ./references/aggregations-grouping.md
├─ Group by columns → ./references/aggregations-grouping.md
├─ Multiple aggregations → ./references/aggregations-grouping.md
├─ Window functions (over) → ./references/aggregations-grouping.md
├─ Rolling/moving averages → ./references/aggregations-grouping.md
├─ Cumulative operations → ./references/aggregations-grouping.md
└─ Ranking within groups → ./references/aggregations-grouping.md
```

### "I need to combine DataFrames"

```
Combining data?
├─ Join two DataFrames → ./references/joins-concat.md
├─ Left/right/outer join → ./references/joins-concat.md
├─ Anti-join (not in) → ./references/joins-concat.md
├─ Concatenate vertically → ./references/joins-concat.md
├─ Pivot (long to wide) → ./references/joins-concat.md
└─ Unpivot/melt (wide to long) → ./references/joins-concat.md
```

### "I need better performance"

```
Performance issues?
├─ Use lazy evaluation → ./references/performance.md
├─ Avoid row iteration → ./references/performance.md
├─ Reduce memory usage → ./references/performance.md
├─ Process large files → ./references/performance.md
├─ Optimize query plan → ./references/performance.md
└─ Common anti-patterns → ./references/performance.md
```

### "Something isn't working"

```
Having issues?
├─ Type errors → ./references/gotchas.md
├─ Null handling → ./references/gotchas.md
├─ Expression context errors → ./references/gotchas.md
├─ String operations → ./references/strings-datetime-categorical.md
├─ Date parsing issues → ./references/strings-datetime-categorical.md
├─ Performance problems → ./references/gotchas.md
├─ Pandas migration issues → ./references/gotchas.md
├─ Memory errors → ./references/gotchas.md
└─ General troubleshooting → ./references/gotchas.md
```

## File-First Execution in Research Workflows

**Important:** In data research pipelines (see `CLAUDE.md`), Polars transformations are executed through **script files**, not interactively. This ensures auditability and reproducibility.

**The pattern:**
1. Write transformation code to `scripts/stage{N}_{type}/{step}_{task-name}.py`
2. Execute via Bash with automatic output capture wrapper script
3. Validation results get automatically embedded in scripts as comments
4. If failed, create versioned copy for fixes

Closely read `agent_reference/SCRIPT_EXECUTION_REFERENCE.md` for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules.

**See:**
- `agent_reference/SCRIPT_EXECUTION_REFERENCE.md` — Script execution protocol and format with validation

The examples below show Polars syntax. In research workflows, wrap them in scripts following the file-first pattern.

---

## Quick Reference

### Essential Import

```python
import polars as pl
import polars.selectors as cs  # For column selection by type
```

### Lazy vs Eager (One-Liner)

```python
# Eager: immediate execution
df = pl.read_csv("data.csv")

# Lazy: deferred, optimized execution (preferred for large data)
lf = pl.scan_csv("data.csv")
df = lf.collect()  # Execute when ready
```

### Core Expression Patterns

```python
# Select columns
df.select("a", "b")
df.select(pl.col("a"), pl.col("b"))
df.select(pl.all().exclude("id"))

# Filter rows
df.filter(pl.col("a") > 10)
df.filter((pl.col("a") > 10) & (pl.col("b") == "x"))

# Add/modify columns
df.with_columns(
    (pl.col("a") * 2).alias("a_doubled"),
    pl.col("b").str.to_uppercase().alias("b_upper")
)

# Conditional column
df.with_columns(
    pl.when(pl.col("a") > 10)
      .then(pl.lit("high"))
      .otherwise(pl.lit("low"))
      .alias("category")
)

# Group and aggregate
df.group_by("category").agg(
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("average"),
    pl.len().alias("count")
)
```

### Essential Functions

| Function | Purpose |
|----------|---------|
| `pl.col("name")` | Reference a column |
| `pl.lit(value)` | Literal value |
| `pl.all()` | All columns |
| `pl.exclude("col")` | All except specified |
| `pl.len()` | Row count |
| `pl.when().then().otherwise()` | Conditional logic |
| `.alias("name")` | Rename result |
| `.cast(pl.Int64)` | Convert type |

### Common Data Types

| Type | Description |
|------|-------------|
| `pl.Int64`, `pl.Int32` | Integers |
| `pl.Float64`, `pl.Float32` | Floats |
| `pl.String` (or `pl.Utf8`) | Strings |
| `pl.Boolean` | True/False |
| `pl.Date`, `pl.Datetime` | Dates and timestamps |
| `pl.Duration` | Time differences |
| `pl.Categorical` | Categorical strings |
| `pl.List` | List of values |
| `pl.Struct` | Named fields |

### Quick Cheatsheet

```python
# I/O
df = pl.read_csv/parquet/json("file")
lf = pl.scan_csv/parquet/ndjson("file")  # Lazy
df.write_csv/parquet/json("file")

# Selection
df.select("a", "b")
df.select(cs.numeric())  # By type

# Filtering
df.filter(pl.col("a") > 1)

# Aggregation
df.group_by("key").agg(pl.col("val").sum())

# Joining
df1.join(df2, on="key", how="left")

# Sorting
df.sort("col", descending=True)

# Lazy execution
lf.collect()  # Run query
lf.explain()  # Show plan
```

## Topic Index

| Topic | Reference File |
|-------|---------------|
| Installation | `./references/quickstart.md` |
| DataFrame Creation | `./references/quickstart.md` |
| Lazy vs Eager | `./references/quickstart.md` |
| Column Selection | `./references/dataframes-series.md` |
| Row Filtering | `./references/dataframes-series.md` |
| Adding Columns | `./references/dataframes-series.md` |
| CSV Files | `./references/io-data.md` |
| Parquet Files | `./references/io-data.md` |
| Database Connections | `./references/io-data.md` |
| Expressions | `./references/expressions.md` |
| Method Chaining | `./references/expressions.md` |
| Contexts | `./references/expressions.md` |
| GroupBy | `./references/aggregations-grouping.md` |
| Window Functions | `./references/aggregations-grouping.md` |
| Rolling Windows | `./references/aggregations-grouping.md` |
| Joins | `./references/joins-concat.md` |
| Concatenation | `./references/joins-concat.md` |
| Pivot/Unpivot | `./references/joins-concat.md` |
| String Operations | `./references/strings-datetime-categorical.md` |
| Datetime Handling | `./references/strings-datetime-categorical.md` |
| Categorical Data | `./references/strings-datetime-categorical.md` |
| Query Optimization | `./references/performance.md` |
| Memory Management | `./references/performance.md` |
| Anti-Patterns | `./references/performance.md` |
| Pandas Conversion | `./references/interop.md` |
| NumPy Integration | `./references/interop.md` |
| DuckDB Integration | `./references/interop.md` |
| Type Errors | `./references/gotchas.md` |
| qcut Label Gotcha | `./references/gotchas.md` |
| Null Handling Issues | `./references/gotchas.md` |
| Expression Context Errors | `./references/gotchas.md` |
| Performance Anti-Patterns | `./references/gotchas.md` |
| Migration from Pandas | `./references/gotchas.md` |
| Memory Issues | `./references/gotchas.md` |

## Citation

When this library is used as a primary analytical tool, include in the report's
Software & Tools references:

> Vink, R. et al. Polars: Blazingly fast DataFrames [Computer software]. https://pola.rs/

**Cite when:** Polars is the core data processing engine for the analysis (typically always true in DAAF pipelines).
**Do not cite when:** Only used for trivial file I/O in a script primarily using another tool.

Related Skills

svy

160
from DAAF-Contribution-Community/daaf

Complex survey analysis: strata/PSU/weights, variance estimation (Taylor, BRR, jackknife, bootstrap), survey GLM, domain analysis, calibration. Polars-native. Use for NHANES, CPS, ACS PUMS, BRFSS, DHS. Non-survey regression: statsmodels/pyfixest.

statsmodels

160
from DAAF-Contribution-Community/daaf

Statistical modeling: OLS/WLS/GLS, GLM (logit, probit, Poisson), time series (ARIMA, VAR), mixed effects, diagnostics. Formula API. Use for regressions without fixed effects, GLMs, or time series. For FE/DiD use pyfixest; panel/IV use linearmodels.

stata-python-translation

160
from DAAF-Contribution-Community/daaf

Stata-to-Python translation for data analysis. Maps Stata commands (reghdfe, xtreg, ivregress, margins, esttab, svy:) to Python (polars, pyfixest, statsmodels, svy). Use when user has Stata background or requests Stata-equivalent code comments.

skill-authoring

160
from DAAF-Contribution-Community/daaf

Guide for creating and auditing DAAF skills (SKILL.md). Covers frontmatter, metadata vocabulary, progressive disclosure, decision trees, reference files. Use when creating, reviewing, or debugging skill loading. For agent files, use agent-authoring.

science-communication

160
from DAAF-Contribution-Community/daaf

Translating technical findings for non-technical audiences. Narrative frameworks (Pyramid Principle, SCQA), plain-language translation, executive summaries, policy briefs, causal language. Use when presenting to stakeholders or reviewing deliverables

r-python-translation

160
from DAAF-Contribution-Community/daaf

R-to-Python translation for data analysis. Maps R packages (tidyverse, ggplot2, fixest, survey, sf, plm) to Python equivalents (polars, plotnine, pyfixest, svy, geopandas). Use when user has R background or requests R-equivalent code comments.

pyfixest

160
from DAAF-Contribution-Community/daaf

Fast high-dimensional fixed effects: OLS, Poisson, IV with multi-way FE; DiD (TWFE, did2s, Sun-Abraham); clustered SEs; etable/coefplot/iplot. Use for FE regressions or DiD. For panel RE/between use linearmodels; for GLM without FE use statsmodels.

plotnine

160
from DAAF-Contribution-Community/daaf

plotnine static visualization (ggplot2 syntax for Python). Geoms, aesthetics, scales, coordinates, facets, themes. Use for static publication-quality figures with grammar-of-graphics syntax. For interactive charts use plotly; for maps use geopandas.

plotly

160
from DAAF-Contribution-Community/daaf

Plotly interactive visualization. Express and Graph Objects: scatter, line, bar, heatmap, 3D, geographic charts; subplots; styling; export. Use when interactivity (hover/zoom) is needed. For static figures use plotnine; for GIS use geopandas.

marimo

160
from DAAF-Contribution-Community/daaf

Reactive Python notebook system. Cell reactivity, UI elements (sliders, dropdowns, tables), SQL cells, plotting, app deployment. Use when assembling Stage 9 notebooks, building data apps, or converting Jupyter to marimo .py format.

linearmodels

160
from DAAF-Contribution-Community/daaf

Panel data, IV/GMM, system regression. PanelOLS (FE/RE), BetweenOLS, Fama-MacBeth, IV2SLS/LIML/GMM, SUR, 3SLS, Driscoll-Kraay SEs. Use for RE/between, system estimation, or GMM. Complements pyfixest (FE + DiD) and statsmodels (GLM + time series).

geopandas

160
from DAAF-Contribution-Community/daaf

Spatial data: GeoDataFrames, spatial joins, CRS/projections, choropleth/interactive maps, spatial autocorrelation, PySAL. Use for geographic data, spatial files (Shapefile, GeoPackage, GeoParquet), or spatial stats. For charts without GIS use plotly.