stata-data-cleaning

Clean, transform, and validate messy research data using Stata

191 stars

Best use case

stata-data-cleaning is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Clean, transform, and validate messy research data using Stata

Teams using stata-data-cleaning should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/stata-data-cleaning/SKILL.md --create-dirs "https://raw.githubusercontent.com/wentorai/research-plugins/main/skills/analysis/wrangling/stata-data-cleaning/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/stata-data-cleaning/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How stata-data-cleaning Compares

Feature / Agentstata-data-cleaningStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Clean, transform, and validate messy research data using Stata

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Stata Data Cleaning

Clean, transform, and validate messy research datasets in Stata. This skill covers the complete data preparation pipeline from raw survey or administrative data to analysis-ready datasets, with emphasis on documentation, reproducibility, and handling the common data quality issues encountered in social science, economics, and health research.

## Overview

Data cleaning typically consumes 60-80% of research time in empirical studies, yet it is often under-documented and poorly reproducible. Stata provides a powerful set of commands for data manipulation, but knowing which commands to use and in what order requires experience with common data quality issues: inconsistent coding, duplicate observations, string formatting problems, implausible values, and complex missing data patterns.

This skill provides a systematic, step-by-step data cleaning workflow in Stata. Each step produces a log of changes made, enabling full reproducibility and audit trails. The workflow is organized around the principle that raw data should never be modified in place -- instead, cleaning scripts transform raw data into processed datasets while preserving the original.

The approach follows best practices from the World Bank's DIME Analytics team and the J-PAL research transparency guidelines, making it suitable for projects that require rigorous data documentation for peer review, replication packages, or regulatory compliance.

## Initial Data Assessment

### Loading and Inspecting Data

```stata
* ============================================
* Data Cleaning Script: [Project Name]
* Author: [Name]
* Date: [Date]
* Input: raw/survey_data_raw.dta
* Output: processed/survey_data_clean.dta
* ============================================

clear all
set more off
log using "logs/cleaning_log.smcl", replace

* Load raw data
use "raw/survey_data_raw.dta", clear

* Basic inspection
describe
summarize
codebook, compact

* Check dimensions
display "Observations: " _N
display "Variables: " c(k)

* Check for duplicates on ID variable
duplicates report respondent_id
duplicates list respondent_id if duplicates(respondent_id) > 0
```

### Data Quality Report

```stata
* Generate a data quality summary
foreach var of varlist _all {
    quietly {
        count if missing(`var')
        local nmiss = r(N)
        local pctmiss = (`nmiss' / _N) * 100
    }
    if `pctmiss' > 0 {
        display "`var': `nmiss' missing (`pctmiss'%)"
    }
}

* Check value ranges for numeric variables
foreach var of varlist age income years_education {
    summarize `var', detail
    * Flag implausible values
    count if `var' < 0 & !missing(`var')
    count if `var' > 150 & !missing(`var')
}
```

## String Cleaning

### Standardizing Text Variables

```stata
* Trim whitespace
replace name = strtrim(name)
replace name = stritrim(name)  // Remove internal multiple spaces

* Standardize case
replace city = proper(city)        // Title case
replace country = upper(country)   // Upper case
replace email = lower(email)       // Lower case

* Remove special characters
replace phone = ustrregexra(phone, "[^0-9]", "")

* Fix encoding issues
replace name = ustrfix(name)

* Standardize common variations
replace department = "Computer Science" if ///
    inlist(department, "CS", "Comp Sci", "Comp. Sci.", "CompSci")

replace gender = "Female" if inlist(gender, "F", "f", "female", "FEMALE")
replace gender = "Male" if inlist(gender, "M", "m", "male", "MALE")
```

### Parsing Complex Strings

```stata
* Split full name into first and last
gen first_name = word(full_name, 1)
gen last_name = word(full_name, -1)

* Extract year from date string "March 15, 2024"
gen year = real(word(date_string, -1))

* Parse numeric values from strings like "$1,234.56"
gen income_clean = real(subinstr(subinstr(income_str, "$", "", .), ",", "", .))
```

## Missing Data Handling

### Identifying Missing Data Patterns

```stata
* Install missing data analysis tools
ssc install mdesc
ssc install misstable

* Summary of missing data
mdesc

* Missing data patterns
misstable summarize
misstable patterns

* Create missing indicator variables
foreach var of varlist income education occupation {
    gen mi_`var' = missing(`var')
}

* Test whether missing is random (Little's MCAR test approximation)
* Compare means of observed variables by missing status
foreach var of varlist income education {
    ttest age, by(mi_`var')
    ttest gender_numeric, by(mi_`var')
}
```

### Recoding Missing Values

```stata
* Common survey codes for missing
* -99 = refused, -88 = don't know, -77 = not applicable
foreach var of varlist income satisfaction trust_score {
    replace `var' = .r if `var' == -99  // .r = refused
    replace `var' = .d if `var' == -88  // .d = don't know
    replace `var' = .n if `var' == -77  // .n = not applicable
}

* Extended missing values preserve the reason for missingness
* while still being treated as missing in analyses
```

## Variable Construction

### Recoding and Categorization

```stata
* Create age groups
recode age (18/29 = 1 "18-29") (30/44 = 2 "30-44") ///
           (45/59 = 3 "45-59") (60/max = 4 "60+"), gen(age_group)

* Create binary indicator
gen high_income = (income > 75000) if !missing(income)

* Create composite scale (e.g., Likert items)
alpha item1 item2 item3 item4 item5, gen(scale_score) item
* Cronbach's alpha is reported; scale_score is the mean

* Standardize continuous variables
foreach var of varlist income education_years age {
    egen z_`var' = std(`var')
}

* Winsorize extreme values
winsor2 income, cuts(1 99) replace
```

### Date Variables

```stata
* Parse date strings
gen interview_date = date(date_string, "MDY")
format interview_date %td

* Extract components
gen interview_year = year(interview_date)
gen interview_month = month(interview_date)
gen interview_dow = dow(interview_date)  // 0=Sunday

* Calculate durations
gen days_since_treatment = interview_date - treatment_date
gen months_since = (interview_date - treatment_date) / 30.44
```

## Data Validation

### Assertion-Based Validation

```stata
* These assertions halt execution if violated
assert _N == 5000  // Expected sample size
assert !missing(respondent_id)  // No missing IDs
assert age >= 18 & age <= 120 if !missing(age)  // Plausible age range
assert inlist(gender, "Male", "Female", "Other", "") | missing(gender)

* Cross-variable consistency checks
assert education_years >= 0 if !missing(education_years)
assert income >= 0 if !missing(income)
assert end_date >= start_date if !missing(end_date) & !missing(start_date)
```

### Duplicate Detection and Resolution

```stata
* Identify duplicates
duplicates tag respondent_id, gen(dup_flag)
list respondent_id survey_date if dup_flag > 0, sepby(respondent_id)

* Keep most recent observation per respondent
bysort respondent_id (survey_date): keep if _n == _N

* Or keep first observation
bysort respondent_id (survey_date): keep if _n == 1
```

## Saving and Documentation

```stata
* Label all variables
label variable age "Age at time of interview (years)"
label variable income "Annual household income (USD)"
label variable education_years "Total years of formal education"

* Save cleaned dataset
compress  // Reduce file size
save "processed/survey_data_clean.dta", replace

* Export codebook
codebook, compact
describe, short

* Close log
log close
```

## Best Practices

1. **Never modify raw data files**: Always read raw data and write to a separate processed file.
2. **Log everything**: Use `log using` to capture all output for audit trails.
3. **Use assert statements**: Validate assumptions about the data at each stage.
4. **Document decisions**: Comment every recode, drop, or imputation with the rationale.
5. **Version your cleaning scripts**: Use git to track changes to .do files.
6. **Produce a data dictionary**: Label every variable and value label in the final dataset.

## References

- Stata Data Management Reference Manual: https://www.stata.com/manuals/d.pdf
- DIME Analytics Data Management Wiki: https://dimewiki.worldbank.org/Data_Management
- J-PAL Research Resources: https://www.povertyactionlab.org/resource/data-cleaning
- Long, J.S. (2009), The Workflow of Data Analysis Using Stata, Stata Press

Related Skills

json-data-visualizer

191
from wentorai/research-plugins

Guide to JSON Crack for visualizing complex JSON data structures

datagen-research-guide

191
from wentorai/research-plugins

AI-driven multi-agent research assistant for end-to-end studies

data-collection-automation

191
from wentorai/research-plugins

Automate survey deployment, data collection, and pipeline management

database-comparison-guide

191
from wentorai/research-plugins

Compare major academic databases and when to use each for research

wikidata-api-guide

191
from wentorai/research-plugins

Query Wikidata SPARQL for scholarly metadata, authors, and entities

datacite-api

191
from wentorai/research-plugins

Resolve dataset DOIs and query research data metadata via DataCite

crossref-event-data-api

191
from wentorai/research-plugins

Track scholarly mentions across the web via Crossref Event Data

metadata-skills

191
from wentorai/research-plugins

24 metadata & bibliometrics skills. Trigger: DOI resolution, citation metrics, author disambiguation, bibliometrics. Design: metadata APIs and bibliometric analysis tools for scholarly records.

dataverse-api

191
from wentorai/research-plugins

Deposit and discover research datasets via Harvard Dataverse API

ipums-microdata-api

191
from wentorai/research-plugins

Access harmonized census and survey microdata via the IPUMS API

astrophysics-data-guide

191
from wentorai/research-plugins

Astronomical data processing with Astropy, FITS files, and sky surveys

topology-data-analysis

191
from wentorai/research-plugins

Topological data analysis: persistent homology, Mapper, and TDA tools