education-data-query

Downloads education datasets from configured mirror sources (parquet/CSV) with local Polars filtering. Use when writing fetch scripts or retrieving CCD, IPEDS, CRDC, SAIPE data. Load after education-data-explorer — retrieval here, not discovery.

160 stars

byDAAF-Contribution-Community

View on GitHub Installation ↓

Best use case

education-data-query is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using education-data-query should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/education-data-query/SKILL.md --create-dirs "https://raw.githubusercontent.com/DAAF-Contribution-Community/daaf/main/.claude/skills/education-data-query/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/education-data-query/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How education-data-query Compares

Feature / Agent	education-data-query	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Education Data Query

Downloads education datasets from configured mirror sources (parquet or CSV) using priority-ordered fallback, with local Polars filtering. Use when writing Stage 5 fetch scripts, downloading a specific CCD, IPEDS, CRDC, SAIPE, or other education dataset by path, discovering which files are available on a mirror, or retrieving codebook metadata. Load after using education-data-explorer to identify endpoints — this skill handles actual data retrieval, not endpoint discovery.

Download datasets from the Education Data Portal via configured mirror sources (defined in mirrors.yaml). Mirrors are tried in priority order. All filtering is done locally with Polars. The mirror data originates from the Urban Institute Education Data Portal (EDP), which is a curation and standardization layer over original federal data sources — data has been restructured with lowercase variable names, integer-encoded categoricals, and standardized missing value codes (`-1`, `-2`, `-3`).

## What This Skill Does

- Download education datasets from configured mirrors
- Handle multiple file formats (parquet, CSV) based on mirror read_strategy
- Apply year, state, and demographic filters locally with Polars
- Discover available files via each mirror's discovery endpoint

> **Skill Provenance Note:** Each `*-data-source-*` skill includes
> `provenance.skill_last_updated` in its frontmatter. Before fetching data,
> check this date — if it is more than a few months old, the source skill's
> documentation about column definitions, coded values, and quality patterns
> may have drifted from the current data. Consider re-running data-ingest to
> re-verify before relying on stale skill guidance for query construction.

## Reference File Structure

| File | Purpose | When to Read |
|------|---------|--------------|
| `mirrors.yaml` | Mirror URLs, priority, format, timeouts, metadata config | Understanding mirror configuration |
| `fetch-patterns.md` | Code patterns for mirror-based fetching | Writing Stage 5 fetch scripts |
| `datasets-reference.md` | Known dataset file paths by source | Finding the right file path for a dataset |
| `filters-reference.md` | Complete filter variables | Filtering downloaded data locally |
| `query-patterns.md` | Endpoint path structure reference | Understanding URL/path naming conventions |

## Mirror System Overview

Data is fetched by downloading files from mirrors:

```
Fetch Request (dataset, years, filters)
    → Try each mirror in priority order (per mirrors.yaml)
        → Build URL from mirror's url_template + dataset paths
        → Read using mirror's read_strategy (eager_parquet, lazy_csv, etc.)
    → If all mirrors fail: STOP and escalate
    → Save to data/raw/*.parquet
    → CP1 validation (source-agnostic)
```

### Mirror Configuration

Mirrors are defined in `./references/mirrors.yaml` with priority ordering. Each mirror specifies:
- `url_template` — how to build download URLs
- `read_strategy` — how Polars reads the format (eager_parquet, lazy_csv)
- `discovery` — how to check what files are available

See `./references/mirrors.yaml` for the full configuration and instructions on adding new mirrors.

### Mirror File Discovery

Before fetching, you can check what files are available using each mirror's discovery endpoint (defined in mirrors.yaml):

```python
# Generic discovery — works with any mirror that supports it
# See fetch-patterns.md for the full discover_mirror_files() function
from fetch_patterns import discover_mirror_files

# Check primary mirror
files = discover_mirror_files(MIRRORS[0])
if files is not None:
    print(f"Available files: {len(files)}")
```

This eliminates guessing — if the file exists in a mirror, use it; if not, fall through to the next.

## Decision Trees

### "How should I get this data?"

```
What dataset do you need?
├─ Know the exact file path?
│   └─ Use fetch_from_mirrors() with that path → ./references/fetch-patterns.md
├─ Know the source but not the exact filename?
│   └─ Check ./references/datasets-reference.md for known paths
├─ Not sure what's available?
│   └─ Query mirror discovery endpoint to list all files → ./references/fetch-patterns.md
├─ Need a codebook or metadata file?
│   └─ Check codebook column in ./references/datasets-reference.md → get_codebook_url() in ./references/fetch-patterns.md
└─ Dataset not in any mirror?
    └─ STOP and escalate — dataset may need to be added to mirror
```

### "Is my dataset a single file or yearly files?"

```
Check datasets-reference.md:
├─ Type = "Single" → One file with all years
│   └─ Use fetch_from_mirrors() → filter years locally
└─ Type = "Yearly" → One file per year
    └─ Use fetch_yearly_from_mirrors() → concatenate results
```

### "How do I filter results?"

All filtering is done locally with Polars after download:

```python
# By state
df = df.filter(pl.col("fips") == 6)  # California

# By year
df = df.filter(pl.col("year").is_in([2020, 2021, 2022]))

# By school type
df = df.filter(pl.col("charter") == 1)

# Multiple filters
df = df.filter(
    (pl.col("fips") == 6) &
    (pl.col("charter") == 1) &
    (pl.col("school_level") == 3)
)
```

## Dataset Path Structure

All mirrors use the same canonical path. Each mirror appends its own format extension (`.parquet`, `.csv`) via its `url_template` in mirrors.yaml:

```
{source}/{filename}
```

| Component | Description | Examples |
|-----------|-------------|----------|
| `source` | Data source | `ccd`, `ipeds`, `crdc`, `saipe`, `edfacts` |
| `filename` | Dataset file | `schools_ccd_directory`, `districts_saipe` |

Example paths:
- `saipe/districts_saipe` (SAIPE district poverty)
- `ccd/schools_ccd_directory` (CCD school directory)
- `ccd/schools_ccd_enrollment_2022` (CCD enrollment, yearly)

See `./references/datasets-reference.md` for the complete file path listing.

## Format Handling

Format-specific read behavior is driven by each mirror's `read_strategy` field (see `mirrors.yaml`):

### `eager_parquet`
```python
df = pl.read_parquet(url)  # Polars reads HTTP URLs natively
```

### `lazy_csv`
```python
# Always use lazy loading for large files
df = (
    pl.scan_csv(url, infer_schema_length=10000)
    .filter(pl.col("year").is_in(YEARS))
    .filter(pl.col("fips") == STATE_FIPS)
    .collect()
)
```

See `./references/fetch-patterns.md` for complete code patterns.

## Portal Integer Encoding

**CRITICAL:** The Portal uses integer codes, not string labels. This affects filtering and interpretation.

### Demographic Variable Encodings

| Variable | Integer Values | NOT These Strings |
|----------|----------------|-------------------|
| Race | 1-7, 99 (total) | WH, BL, HI, AS, etc. |
| Sex | 1 (Male), 2 (Female), 3 (Another gender, IPEDS 2022+), 4 (Unknown gender, IPEDS 2022+), 9 (Unknown), 99 (Total) | M, F |
| Grade | -1 to 13, 99 (total) | PK, KG, 01, etc. |

### Grade Encoding (SEMANTIC TRAP!)

| Value | Meaning | URL Path Equivalent |
|-------|---------|---------------------|
| -1 | Pre-K (**NOT missing!**) | `grade-pk` |
| 0 | Kindergarten | `grade-k` |
| 1-12 | Grades 1-12 | `grade-1` to `grade-12` |
| 99 | Total | `grade-99` |

```python
# WRONG - filters out Pre-K students!
df = df.filter(pl.col("grade") >= 0)

# RIGHT - Pre-K students have grade = -1
pre_k = df.filter(pl.col("grade") == -1)
total = df.filter(pl.col("grade") == 99)
```

### Variable Names Are Lowercase

Portal variable names are lowercase:
- `enrollment` not `MEMBER`
- `grade` not `GRADE`
- `fips` not `FIPS`

See `./references/filters-reference.md` for complete encoding tables.

## Common FIPS Codes

| Code | State | Code | State | Code | State |
|------|-------|------|-------|------|-------|
| 1 | Alabama | 17 | Illinois | 36 | New York |
| 2 | Alaska | 18 | Indiana | 37 | North Carolina |
| 4 | Arizona | 19 | Iowa | 39 | Ohio |
| 5 | Arkansas | 20 | Kansas | 40 | Oklahoma |
| 6 | California | 21 | Kentucky | 41 | Oregon |
| 8 | Colorado | 22 | Louisiana | 42 | Pennsylvania |
| 9 | Connecticut | 24 | Maryland | 44 | Rhode Island |
| 10 | Delaware | 25 | Massachusetts | 45 | South Carolina |
| 11 | DC | 26 | Michigan | 47 | Tennessee |
| 12 | Florida | 27 | Minnesota | 48 | Texas |
| 13 | Georgia | 29 | Missouri | 49 | Utah |
| 15 | Hawaii | 32 | Nevada | 51 | Virginia |
| 16 | Idaho | 34 | New Jersey | 53 | Washington |

See `./references/filters-reference.md` for complete list.

## Cross-References

- **Discover endpoints:** Load `education-data-explorer` skill to browse available endpoints and variables
- **Interpret data:** Load `education-data-context` skill after fetching for variable meanings and caveats
- **Deep source understanding:** Load `education-data-source-*` skills for comprehensive methodology

### Data Source Skills Quick Reference

| Source | Skill | Key Fetch Considerations |
|--------|-------|--------------------------|
| CCD | `education-data-source-ccd` | Use grade-99 for totals; FRPL affected by CEP |
| CRDC | `education-data-source-crdc` | Biennial only; 2015+ for complete coverage; CSV requires `schema_overrides` for ID cols (see CRDC skill) |
| EDFacts | `education-data-source-edfacts` | Use `_midpt` vars; states not comparable |
| IPEDS | `education-data-source-ipeds` | GRS limited to first-time full-time |
| Scorecard | `education-data-source-scorecard` | High suppression; Title IV recipients only |
| SAIPE | `education-data-source-saipe` | Model estimates; population not enrollment |
| FSA | `education-data-source-fsa` | Federal aid only; 1-3 year lag |
| MEPS | `education-data-source-meps` | Better than FRPL for cross-state |
| PSEO | `education-data-source-pseo` | Experimental; check state coverage |

## Topic Index

| Topic | Location |
|-------|----------|
| Mirror configuration | `./references/mirrors.yaml` |
| Fetch code patterns | `./references/fetch-patterns.md` |
| Dataset file paths | `./references/datasets-reference.md` |
| URL/path naming conventions | `./references/query-patterns.md` |
| Filter variables | `./references/filters-reference.md` |
| Codebook/metadata URLs | `./references/datasets-reference.md` (codebook column), `./references/fetch-patterns.md` (get_codebook_url) |
| FIPS codes | This file, `./references/filters-reference.md` |
| CCD source details | `education-data-source-ccd` skill |
| CRDC source details | `education-data-source-crdc` skill |
| EDFacts source details | `education-data-source-edfacts` skill |
| IPEDS source details | `education-data-source-ipeds` skill |
| Scorecard source details | `education-data-source-scorecard` skill |
| SAIPE source details | `education-data-source-saipe` skill |
| FSA source details | `education-data-source-fsa` skill |
| MEPS source details | `education-data-source-meps` skill |
| NHGIS source details | `education-data-source-nhgis` skill |

Related Skills

election-data-source-countypres

160

from DAAF-Contribution-Community/daaf

County Presidential Returns 2000-2024 (MIT MEDSL). Vote shares, party trends, turnout by county_fips (joins census/education data). Requires HARVARD_DATAVERSE_API_KEY. Critical: mode='TOTAL' drops ~1K counties post-2020 — use 3-pattern reconstruction

education-data-source-scorecard

160

from DAAF-Contribution-Community/daaf

College Scorecard — post-enrollment outcomes linking aid records to IRS/Treasury earnings. Earnings, loan repayment, debt via six Portal sub-datasets. Use when tax-record-based earnings needed. Tracks only Title IV aid recipients, not all students.

education-data-source-saipe

160

from DAAF-Contribution-Community/daaf

SAIPE — annual Census poverty estimates for school districts (Portal; county/state not in Portal). Use for district poverty, Title I context, or trends. ~18-month lag. No race/ethnicity disaggregation at district level — use ACS 5-year for that.

education-data-source-pseo

160

from DAAF-Contribution-Community/daaf

PSEO — Census data linking graduates to employment via LEHD wage records. Earnings percentiles at 1/5/10 years post-graduation by institution, degree, CIP. Use for graduate earnings analysis. Coverage: ~29% of graduates from ~31 states.

education-data-source-nhgis

160

from DAAF-Contribution-Community/daaf

NHGIS — census geography crosswalks via Portal: links schools (ncessch) and colleges (unitid) to tracts, block groups, CBSAs (1990-2020). Census demographics NOT in Portal — access NHGIS directly. Use for linking education data to census geography.

education-data-source-nccs

160

from DAAF-Contribution-Community/daaf

NCCS — Form 990 data for private nonprofit colleges (Portal: IPEDS-matched, 1993-2016). Revenue, expenses, assets, endowment, governance beyond IPEDS. Use when IRS financial depth needed. Portal ends 2016; public institutions excluded (no Form 990).

education-data-source-nacubo

160

from DAAF-Contribution-Community/daaf

NACUBO endowment data (~650 institutions, 2012-2022). Portal: 7 columns only (total endowment, per-FTE, YoY change). Use for endowment size/trends. Full investment/spending needs direct NACUBO access. For all-institution coverage use IPEDS finance.

education-data-source-meps

160

from DAAF-Contribution-Community/daaf

MEPS — Urban Institute modeled school-level poverty (% at 100% FPL), from CCD + SAIPE (public schools, 2009-2022, 2-3yr lag). Use when FRPL is unreliable due to CEP. Consistent cross-state measurement. Public schools only.

education-data-source-ipeds

160

from DAAF-Contribution-Community/daaf

IPEDS — primary federal postsecondary data (~6,500 institutions, 1980-present): enrollment, completions, graduation rates, finance, aid, admissions, HR. For college/university analysis. Grad rates = first-time full-time; finance needs GASB/FASB care.

education-data-source-fsa

160

from DAAF-Contribution-Community/daaf

FSA — Title IV aid at institution level (~5,500 institutions, 1999-2021). Pell Grants, Direct/PLUS loans, campus-based aid, financial responsibility scores, 90/10 metrics. Use for aid distribution, loan volume, or for-profit analysis. By unitid.

education-data-source-edfacts

160

from DAAF-Contribution-Community/daaf

EDFacts — K-12 outcomes: assessment proficiency, ACGR graduation rates, ESSA accountability at school/district level (2009-2020). Within-state trends and subgroup gaps. Complements CCD with outcome data. Cannot compare across states — use NAEP.

education-data-source-eada

160

from DAAF-Contribution-Community/daaf

EADA — college athletics gender equity (~2,000+ institutions, 2002-2021). Participation, coaching, salaries, expenses, revenues, athletic aid by gender. Not Title IX compliance data. No sector column; join IPEDS on unitid for institution type.