education-data-source-nhgis

NHGIS — census geography crosswalks via Portal: links schools (ncessch) and colleges (unitid) to tracts, block groups, CBSAs (1990-2020). Census demographics NOT in Portal — access NHGIS directly. Use for linking education data to census geography.

Best use case

education-data-source-nhgis is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

NHGIS — census geography crosswalks via Portal: links schools (ncessch) and colleges (unitid) to tracts, block groups, CBSAs (1990-2020). Census demographics NOT in Portal — access NHGIS directly. Use for linking education data to census geography.

Teams using education-data-source-nhgis should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/education-data-source-nhgis/SKILL.md --create-dirs "https://raw.githubusercontent.com/DAAF-Contribution-Community/daaf/main/.claude/skills/education-data-source-nhgis/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/education-data-source-nhgis/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How education-data-source-nhgis Compares

Feature / Agenteducation-data-source-nhgisStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

NHGIS — census geography crosswalks via Portal: links schools (ncessch) and colleges (unitid) to tracts, block groups, CBSAs (1990-2020). Census demographics NOT in Portal — access NHGIS directly. Use for linking education data to census geography.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# NHGIS Data Source Reference

IPUMS NHGIS — census geography crosswalks and demographic data for education research. Via the Education Data Portal: geographic crosswalk tables linking K-12 schools (ncessch) and colleges (unitid) to census tracts, block groups, CBSAs, and regions (census 1990-2020). Census demographic variables (income, poverty, race, educational attainment) are NOT in the Portal — access directly from NHGIS via free IPUMS registration. Use when linking school or institutional data to census geography for contextual analysis.

Census geography and demographic data source for education research. NHGIS provides the foundation for linking schools to community characteristics via census tracts, block groups, and school district boundaries.

> **CRITICAL: Value Encoding**
>
> When accessing NHGIS data through the Education Data Portal (not NHGIS directly), categorical variables use **integer encodings**, not string labels. Always verify the exact codes in the mirror codebook.
>
> | Variable | Integer Code | Meaning |
> |----------|--------------|---------|
> | `census_region` | `1` | Northeast |
> | `census_region` | `2` | Midwest |
> | `census_region` | `3` | South |
> | `census_region` | `4` | West |
> | `cbsa_type` | `1` | Metropolitan |
> | `cbsa_type` | `2` | Micropolitan |
> | `geocode_accuracy` | `4` | Did not geocode |
>
> See `./references/variable-catalog.md` for complete encoding tables.

> **CRITICAL: Portal Data Scope**
>
> The Education Data Portal provides ONLY **geographic crosswalk tables** that link schools and colleges to census geography (tracts, block groups, regions, CBSAs). These contain geographic identifiers and assignment columns — approximately 35-47 columns per file.
>
> The Portal does **NOT** provide census demographic data (population, income, poverty, race, education attainment, housing, language, etc.). For demographic variables, you must access NHGIS directly via IPUMS (free registration required). See `./references/data-access.md` for direct access methods.
>
> This skill documents both contexts: Portal crosswalk data (with integer encodings above) and direct NHGIS census variables (in `./references/variable-catalog.md`, clearly marked as requiring direct NHGIS access).

## What is NHGIS?

NHGIS (from IPUMS, University of Minnesota) provides free access to census geography and demographic data.

- **Collector**: IPUMS, University of Minnesota
- **Coverage**: US census data from 1790-present (decennial census + ACS)
- **Content**: Summary tables, GIS boundary files, time series tables, geographic crosswalks
- **Frequency**: Decennial census (every 10 years) + ACS (annual, 5-year rolling)
- **Available years**: 1790-2020 (decennial), 2005-2023 (ACS 5-year)
- **Primary identifiers**: GISJOIN (NHGIS internal), GEOID (Census Bureau standard)
- **Education relevance**: Links school locations to community demographics via census tracts, block groups, and school district boundaries
- **Available through Education Data Portal**: Geographic crosswalk tables only (school-to-census and college-to-census links for census 1990, 2000, 2010, 2020). Census demographic data requires direct NHGIS access.

## Reference File Structure

| File | Purpose | When to Read |
|------|---------|--------------|
| `geographic-units.md` | Census geography hierarchy (tracts, blocks, districts) | Understanding census geography |
| `school-geography-links.md` | Linking schools to census areas | Connecting school data to demographics |
| `time-series.md` | Historical data, harmonization methods | Longitudinal analysis |
| `variable-catalog.md` | Key demographic variables, codes, special values | Selecting census variables or interpreting encodings |
| `boundary-changes.md` | How boundaries change between censuses | Handling geographic inconsistencies |
| `data-access.md` | Direct NHGIS access methods (registration, Data Finder, ipumspy) | Custom census analysis beyond Portal |

## Decision Trees

### What geographic level should I use?

```
Research question about...
├─ Individual schools
│   ├─ School's immediate neighborhood → Census tract or block group
│   ├─ School attendance zone → SABINS (limited years) or block-to-school crosswalk
│   └─ School district overall → School district boundaries
├─ School districts
│   ├─ District-level demographics → School district geographic level
│   ├─ Within-district variation → Census tracts within district
│   └─ District poverty estimates → SAIPE (via Education Data Portal)
├─ Regional patterns
│   ├─ County-level → County boundaries
│   ├─ Metro area → CBSA (Core Based Statistical Area)
│   └─ State-level → State boundaries
└─ Historical analysis
    ├─ Consistent boundaries needed → Geographically standardized tables
    └─ Original boundaries OK → Nominally integrated tables
```

### How do I link schools to census data?

```
Linking schools to census demographics?
├─ Have school coordinates (lat/lon)
│   ├─ Point-in-polygon → Spatial join to tract/block group boundaries
│   └─ Need tract ID only → Geocoding service or FCC API
├─ Have school NCES ID only
│   ├─ Use NCES EDGE files → School District Geographic Relationship Files
│   └─ Use Education Data Portal → NHGIS source provides tract links
├─ Need school attendance zones
│   ├─ 2009-2012 data → SABINS school areas
│   └─ Current data → Contact school district (no national source)
└─ See ./references/school-geography-links.md for details
```

### What time period data do I need?

```
Time period?
├─ Single recent year
│   ├─ Tract/block group level → ACS 5-year (most recent)
│   ├─ Larger areas (65K+ pop) → ACS 1-year
│   └─ Full census count → 2020 Decennial Census
├─ Historical comparison
│   ├─ Same boundaries across time → Geographically standardized tables (to 2010)
│   ├─ Original boundaries → Nominally integrated time series
│   └─ Custom standardization → Use geographic crosswalks
├─ Long time series (1970+)
│   └─ See ./references/time-series.md
└─ Pre-1970
    └─ Limited tract coverage; county/state more complete
```

## Quick Reference: Geographic Levels and Variables

### Geographic Levels

| Level | Typical Size | Education Use | NHGIS Coverage |
|-------|--------------|---------------|----------------|
| Block | ~40 people | Point locations | 1990-2020 |
| Block Group | ~1,500 people | School neighborhoods | 1990-2020 |
| Census Tract | ~4,000 people | Community context | 1910-2020 |
| County Subdivision | Varies | Rural areas | 1980-2020 |
| Place | City/town | Urban context | 1980-2020 |
| School District | Varies | District analysis | 2000-2020 |
| County | ~100,000 people | Regional patterns | 1790-2020 |
| State | Varies | Policy analysis | 1790-2020 |

### Key Identifiers

| ID | Format | Level | Example | Notes |
|----|--------|-------|---------|-------|
| `ncessch` | Int64 | School | `10000201704` | NCES school ID (schools Portal data) |
| `unitid` | Int64 | College | `100654` | IPEDS institution ID (colleges Portal data) |
| `GISJOIN` | String with prefix | Any | `G0600010` | NHGIS internal ID; use for direct NHGIS joins (not in Portal data) |
| `GEOID` | Numeric string | Any | `06001402100` | Census Bureau standard; use for non-NHGIS joins (not in Portal data) |
| `tract` | Int64 | Tract | `402100` | Census tract number (in Portal data) |
| `block_group` | Int64 | Block Group | `1` | Block group within tract (1-9; 0=unassigned) |
| `geoid_block` | Int64 | Block | `60014021001001` | Full block FIPS code (in Portal data — stored as Int64, not String) |
| `cbsa` | Int64 | Metro area | `41860` | Core Based Statistical Area code (2000+ census files only) |

### Key Education Variables

| Topic | Example Variables | Source |
|-------|-------------------|--------|
| Child population | Under 18, 5-17 school-age | Decennial, ACS |
| Race/ethnicity | Hispanic, White, Black, Asian, etc. | Decennial, ACS |
| Poverty | Persons below poverty, SNAP receipt | ACS (sample) |
| Education attainment | HS diploma, BA+ (adults) | ACS (sample) |
| Language | English proficiency, language at home | ACS (sample) |
| Housing | Owner/renter, median value, crowding | Decennial, ACS |
| Family structure | Single-parent, grandparent households | ACS (sample) |
| Immigration | Foreign-born, recent immigrants | ACS (sample) |

### Data Sources by Type

| Source | Years | Geographic Detail | Content |
|--------|-------|-------------------|---------|
| Decennial Census | 1790-2020 | Block (1990+) | 100% count: age, sex, race, housing |
| ACS 5-Year | 2005-2023 | Block group | Sample: income, education, language |
| ACS 1-Year | 2010-2023 | Areas 65K+ pop | Sample: same as 5-year |
| Time Series | 1790-2020 | Varies | Harmonized across years |
| Geographic Crosswalks | 1990-2020 | Block+ | Interpolation weights |

### Portal Variables (Schools NHGIS)

Key geographic and identifying columns in the schools NHGIS datasets. Census 2020 files have 47 columns; earlier census years have fewer (e.g., 1990 has 35 columns — no CBSA or legislative district fields).

| Variable | Description | Type |
|----------|-------------|------|
| `ncessch` | NCES school ID | Int64 |
| `leaid` | NCES district ID | Int64 |
| `tract` | Census tract number | Int64 |
| `block_group` | Block group number (1-9; 0 = unassigned) | Int64 |
| `geoid_block` | Full block FIPS identifier | Int64 |
| `census_region` | Census Bureau region (1-4, 9) | Int64 |
| `census_division` | Census Bureau division (1-9) | Int64 |
| `cbsa` | CBSA code (2000+ census files only) | Int64 |
| `cbsa_type` | Metropolitan (1) or Micropolitan (2) | Int64 |
| `cbsa_city` | Principal city indicator (0=No, 1=Yes; 2000+ only). See note below. | Int64 |
| `geocode_accuracy` | Geocode confidence (1=High, 2=Medium, 3=Low, 4=Did not geocode, -2=N/A) | Float64 |
| `geocode_accuracy_detailed` | Geocode match type (1-12) | Int64 |
| `class_code` | FIPS place class code | Int64 |
| `lower_chamber_type` | State legislative district lower chamber type (1-8; census 2010 only). See `variable-catalog.md` for code mapping. | Int64 |
| `geo_latitude` / `geo_longitude` | Geocoded coordinates | Float64 |
| `latitude` / `longitude` | CCD-reported coordinates (many nulls in early years) | Float64 |
| `fips` | State FIPS code | Int64 |
| `puma` | Public Use Microdata Area (2000+ census files only) | Int64 |

### Portal Variables (Colleges NHGIS)

Colleges NHGIS datasets have 38 columns (2020 census). Different identifier set from schools.

| Variable | Description | Type |
|----------|-------------|------|
| `unitid` | IPEDS institution ID | Int64 |
| `opeid` | Office of Postsecondary Education ID | String |
| `tract` | Census tract number | Int64 |
| `block_group` | Block group number (1-9) | Int64 |
| `geoid_block` | Full block FIPS identifier | Int64 |
| `census_region` | Census Bureau region (1-4, 9) | Int64 |
| `census_division` | Census Bureau division (1-9) | Int64 |
| `cbsa` | CBSA code | Int64 |
| `cbsa_type` | Metropolitan (1) or Micropolitan (2) | Int64 |
| `cbsa_city` | Principal city indicator (0=No, 1=Yes; 2000+ only) | Int64 |
| `geocode_accuracy` | Geocode match score (Int64 in colleges, Float64 in schools) | Int64 |
| `county_fips` | County FIPS code | Int64 |
| `county_name` | County name | String |
| `state_abbr` | State abbreviation | String |

### Missing Data Codes

| Code | Meaning | When Used |
|------|---------|-----------|
| `-2` | Not geocoded | `geocode_accuracy` field in Portal data |
| `-1` | Missing/not reported | General missing data indicator (e.g., `latitude`, `county_code`) |
| `0` | Unassigned | `block_group` (rare, ~4 rows in schools) |
| `null` | Not available | Variable not applicable to this record; many columns heavily null in early years |

> **Schema Difference:** Schools NHGIS 2020 files (47 columns) have a different schema than colleges NHGIS 2020 files (38 columns). Schools data includes school-specific identifiers (`ncessch`, `leaid`, `school_name`, mailing/location address fields) while colleges data includes institution-specific identifiers (`unitid`, `opeid`, `inst_name`, `county_name`). Both entity types have block-level geographic precision. Earlier census years have fewer columns (e.g., Schools 1990 has 35 columns — no CBSA or legislative district fields). Do not assume identical column structures when working across entities or census years.

## Data Access

Datasets for NHGIS are available via the mirror system. See `datasets-reference.md` for canonical paths, `mirrors.yaml` for mirror configuration, and `fetch-patterns.md` for fetch code patterns.

| Dataset | Type | Years | Path | Codebook |
|---------|------|-------|------|----------|
| Schools Census 1990 | Single | 1986-2023 | `nhgis/schools_nhgis_geog_1990` | `nhgis/codebook_schools_nhgis_census1990` |
| Schools Census 2000 | Single | 1986-2023 | `nhgis/schools_nhgis_geog_2000` | `nhgis/codebook_schools_nhgis_census2000` |
| Schools Census 2010 | Single | 1986-2023 | `nhgis/schools_nhgis_geog_2010` | `nhgis/codebook_schools_nhgis_census2010` |
| Schools Census 2020 | Single | 1986-2023 | `nhgis/schools_nhgis_geog_2020` | `nhgis/codebook_schools_nhgis_census2020` |
| Colleges Census 1990 | Single | 1980-2023 | `nhgis/colleges_nhgis_geog_1990` | `nhgis/codebook_colleges_nhgis_census1990` |
| Colleges Census 2000 | Single | 1980-2023 | `nhgis/colleges_nhgis_geog_2000` | `nhgis/codebook_colleges_nhgis_census2000` |
| Colleges Census 2010 | Single | 1980-2023 | `nhgis/colleges_nhgis_geog_2010` | `nhgis/codebook_colleges_nhgis_census2010` |
| Colleges Census 2020 | Single | 1980-2023 | `nhgis/colleges_nhgis_geog_2020` | `nhgis/codebook_colleges_nhgis_census2020` |

Codebooks are `.xls` files co-located with data in all mirrors. Use `get_codebook_url()` from `fetch-patterns.md` to construct download URLs.

> **Truth Hierarchy:** When interpreting variable values, apply this priority:
> 1. **Actual data file** (what you observe in the parquet/CSV) — this IS the truth
> 2. **Live codebook** (.xls in mirror) — authoritative documentation, may lag
> 3. **This skill documentation** — convenient summary, may drift from codebook
>
> If this documentation contradicts the codebook, trust the codebook. If the codebook contradicts observed data, trust the data and investigate.

### Filtering

```python
import polars as pl

# Filter to a specific school
school_census = df.filter(pl.col("ncessch") == 10000201704)

# Filter to metropolitan areas only (cbsa_type only in 2000+ census files)
metro = df.filter(pl.col("cbsa_type") == 1)

# Filter to a specific census region (South)
south = df.filter(pl.col("census_region") == 3)

# Filter to a specific year
recent = df.filter(pl.col("year") == 2023)
```

> **Note**: The Portal provides pre-processed school/college-to-census-geography links. For custom census analysis (tract-level demographics, time series, boundary files), use NHGIS directly via methods in `./references/data-access.md` (requires free IPUMS registration).

## Common Pitfalls

| Pitfall | Issue | Solution |
|---------|-------|----------|
| Boundary changes | Tracts split/merged between censuses break longitudinal analysis | Use crosswalks or geographically standardized tables |
| ACS margins of error | Small-area estimates have high uncertainty | Check MOE; aggregate areas if needed |
| Block data limitations | Only 100% count variables available (no income/poverty) | Use block groups for sample data (ACS) |
| GISJOIN vs GEOID | Different ID formats cause join failures | Use GISJOIN for NHGIS joins, GEOID for Census Bureau joins |
| 2020 Census noise | Differential privacy added noise to small-area counts | Check for negative values; prefer ACS for detailed characteristics |
| Schools vs colleges schema | Different column counts (47 vs 38 for 2020) and identifier sets | Check schema before joining; do not assume identical structures |
| Census year schema drift | Earlier census files have fewer columns (e.g., 1990 lacks CBSA/legislative fields) | Check available columns per census year before relying on them |
| geocode_accuracy type | Float64 in schools, Int64 in colleges | Cast to consistent type before cross-entity comparison |
| Using string codes | Portal data uses integer encodings, not string labels | Always verify codes against codebook (see encoding warning above) |

## Related Data Sources

| Source | Relationship | When to Use |
|--------|--------------|-------------|
| `education-data-source-ccd` | School identifiers for linking | Join school data to census geography via `ncessch` |
| `education-data-source-saipe` | District-level poverty | Use SAIPE for district poverty; NHGIS for tract/block group poverty |
| `education-data-source-meps` | School-level poverty | MEPS provides school-level poverty estimates; NHGIS provides community context |
| `education-data-source-ipeds` | College identifiers for linking | Join college data to census geography via `unitid` |
| `education-data-explorer` | Parent discovery skill | Finding available endpoints |
| `education-data-query` | Data fetching | Downloading parquet/CSV files |

## Topic Index

| Topic | Reference File |
|-------|---------------|
| Census tract definition | `./references/geographic-units.md` |
| Block group definition | `./references/geographic-units.md` |
| School district boundaries | `./references/geographic-units.md` |
| School-to-tract linking | `./references/school-geography-links.md` |
| SABINS attendance areas | `./references/school-geography-links.md` |
| NCES EDGE files | `./references/school-geography-links.md` |
| Time series tables | `./references/time-series.md` |
| Geographic standardization | `./references/time-series.md` |
| Geographic crosswalks | `./references/time-series.md` |
| Population variables | `./references/variable-catalog.md` |
| Income/poverty variables | `./references/variable-catalog.md` |
| Education variables | `./references/variable-catalog.md` |
| Tract boundary changes | `./references/boundary-changes.md` |
| 2022 Connecticut changes | `./references/boundary-changes.md` |
| TIGER/Line versions | `./references/boundary-changes.md` |
| Direct NHGIS access | `./references/data-access.md` |
| ipumspy Python package | `./references/data-access.md` |
| Data Finder workflow | `./references/data-access.md` |

Related Skills

election-data-source-countypres

160
from DAAF-Contribution-Community/daaf

County Presidential Returns 2000-2024 (MIT MEDSL). Vote shares, party trends, turnout by county_fips (joins census/education data). Requires HARVARD_DATAVERSE_API_KEY. Critical: mode='TOTAL' drops ~1K counties post-2020 — use 3-pattern reconstruction

education-data-source-scorecard

160
from DAAF-Contribution-Community/daaf

College Scorecard — post-enrollment outcomes linking aid records to IRS/Treasury earnings. Earnings, loan repayment, debt via six Portal sub-datasets. Use when tax-record-based earnings needed. Tracks only Title IV aid recipients, not all students.

education-data-source-saipe

160
from DAAF-Contribution-Community/daaf

SAIPE — annual Census poverty estimates for school districts (Portal; county/state not in Portal). Use for district poverty, Title I context, or trends. ~18-month lag. No race/ethnicity disaggregation at district level — use ACS 5-year for that.

education-data-source-pseo

160
from DAAF-Contribution-Community/daaf

PSEO — Census data linking graduates to employment via LEHD wage records. Earnings percentiles at 1/5/10 years post-graduation by institution, degree, CIP. Use for graduate earnings analysis. Coverage: ~29% of graduates from ~31 states.

education-data-source-nccs

160
from DAAF-Contribution-Community/daaf

NCCS — Form 990 data for private nonprofit colleges (Portal: IPEDS-matched, 1993-2016). Revenue, expenses, assets, endowment, governance beyond IPEDS. Use when IRS financial depth needed. Portal ends 2016; public institutions excluded (no Form 990).

education-data-source-nacubo

160
from DAAF-Contribution-Community/daaf

NACUBO endowment data (~650 institutions, 2012-2022). Portal: 7 columns only (total endowment, per-FTE, YoY change). Use for endowment size/trends. Full investment/spending needs direct NACUBO access. For all-institution coverage use IPEDS finance.

education-data-source-meps

160
from DAAF-Contribution-Community/daaf

MEPS — Urban Institute modeled school-level poverty (% at 100% FPL), from CCD + SAIPE (public schools, 2009-2022, 2-3yr lag). Use when FRPL is unreliable due to CEP. Consistent cross-state measurement. Public schools only.

education-data-source-ipeds

160
from DAAF-Contribution-Community/daaf

IPEDS — primary federal postsecondary data (~6,500 institutions, 1980-present): enrollment, completions, graduation rates, finance, aid, admissions, HR. For college/university analysis. Grad rates = first-time full-time; finance needs GASB/FASB care.

education-data-source-fsa

160
from DAAF-Contribution-Community/daaf

FSA — Title IV aid at institution level (~5,500 institutions, 1999-2021). Pell Grants, Direct/PLUS loans, campus-based aid, financial responsibility scores, 90/10 metrics. Use for aid distribution, loan volume, or for-profit analysis. By unitid.

education-data-source-edfacts

160
from DAAF-Contribution-Community/daaf

EDFacts — K-12 outcomes: assessment proficiency, ACGR graduation rates, ESSA accountability at school/district level (2009-2020). Within-state trends and subgroup gaps. Complements CCD with outcome data. Cannot compare across states — use NAEP.

education-data-source-eada

160
from DAAF-Contribution-Community/daaf

EADA — college athletics gender equity (~2,000+ institutions, 2002-2021). Participation, coaching, salaries, expenses, revenues, athletic aid by gender. Not Title IX compliance data. No sector column; join IPEDS on unitid for institution type.

education-data-source-crdc

160
from DAAF-Contribution-Community/daaf

CRDC — biennial OCR survey of all U.S. public schools (2011-2021). Discipline, course access, harassment, restraint/seclusion by race/sex/disability/EL. Use for civil rights and equity analysis. 2020-21 COVID-impacted; 2011-14 sampled, not universe.