statsmodels

Statistical modeling: OLS/WLS/GLS, GLM (logit, probit, Poisson), time series (ARIMA, VAR), mixed effects, diagnostics. Formula API. Use for regressions without fixed effects, GLMs, or time series. For FE/DiD use pyfixest; panel/IV use linearmodels.

160 stars

byDAAF-Contribution-Community

View on GitHub Installation ↓

Best use case

statsmodels is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using statsmodels should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/statsmodels/SKILL.md --create-dirs "https://raw.githubusercontent.com/DAAF-Contribution-Community/daaf/main/.claude/skills/statsmodels/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/statsmodels/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How statsmodels Compares

Feature / Agent	statsmodels	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# statsmodels Skill

statsmodels general-purpose statistical modeling library for Python. Covers OLS/WLS/GLS, GLM (logit, probit, Poisson, negative binomial), discrete choice models, time series (ARIMA, SARIMAX, VAR), mixed effects (MixedLM), robust regression, hypothesis tests, and comprehensive diagnostics. Supports R-style formula API. Use when fitting regressions without fixed effects, running GLMs or logit/probit, analyzing time series, or using formula syntax. For fixed effects or DiD, use pyfixest; for panel/IV/system models, use linearmodels.

Comprehensive skill for statistical modeling with statsmodels. Use decision trees below to find the right guidance, then load detailed references.

## What is statsmodels?

statsmodels is the general-purpose **statistical modeling** library for Python:
- **Two APIs**: Formula API (`smf.ols("y ~ x1 + x2", data=df)`) for R-style modeling, and array API (`sm.OLS(y, X)`) for programmatic control
- **Broad model coverage**: OLS, WLS, GLS, GLM (all families), logit, probit, multinomial, count models, zero-inflated models, quantile regression, robust regression
- **Time series**: ARIMA, SARIMAX, VAR, exponential smoothing, state space models, unit root tests
- **Diagnostics**: Heteroskedasticity tests, normality tests, specification tests, VIF, influence measures, residual analysis
- **Hypothesis testing**: t-tests, F-tests, Wald tests, likelihood ratio tests, multiple comparison corrections

## How to Use This Skill

### Reference File Structure

| File | Purpose | When to Read |
|------|---------|--------------|
| `quickstart.md` | Installation, formula vs array API, first model | Starting with statsmodels |
| `linear-models.md` | OLS, WLS, GLS, robust regression, quantile regression | Fitting linear models |
| `glm-discrete.md` | GLM families, logit/probit, count models, zero-inflated | Non-linear models, binary/count outcomes |
| `time-series.md` | ARIMA, SARIMAX, VAR, exponential smoothing, unit root tests | Analyzing temporal data |
| `diagnostics.md` | Heteroskedasticity, normality, VIF, influence, residuals | Checking model assumptions |
| `hypothesis-testing.md` | t-tests, F-tests, Wald tests, multiple comparisons | Testing coefficients and comparing models |
| `gotchas.md` | Constant term, convergence, predict pitfalls, pyfixest boundary | Debugging issues |

### Reading Order

1. **New to statsmodels?** Start with `quickstart.md` then `linear-models.md`
2. **Need GLM or logit/probit?** Read `quickstart.md` then `glm-discrete.md`
3. **Time series analysis?** Read `quickstart.md` then `time-series.md`
4. **Checking model assumptions?** Read `diagnostics.md`
5. **Coming from R?** Read `quickstart.md` (formula API mirrors R syntax)

## Related Skills

- **pyfixest**: Use instead of statsmodels when your model needs absorbed fixed effects, IV with FE, or difference-in-differences. pyfixest is faster for FE models; statsmodels is broader for everything else
- **linearmodels**: Use for panel data models (FE, RE, between, first difference, Fama-MacBeth), IV/GMM without FE (2SLS, LIML, GMM), system estimation (SUR, 3SLS), and asset pricing. Built on top of statsmodels; extends it for structured data
- **svy**: Use for survey-weighted regression and estimation with complex survey designs. **Important:** statsmodels WLS is NOT equivalent to survey-weighted regression — WLS handles heteroscedastic errors but does not account for stratification, clustering, or finite population corrections. If your data comes from a complex probability survey (NHANES, ACS PUMS, CPS, ECLS-K, etc.), load the `svy` skill instead
- **data-scientist**: Provides methodology guidance (when to use which model, assumption checking protocol, interpretation). Load alongside statsmodels for the "why"; statsmodels provides the "how"
- **polars**: Data manipulation before modeling. statsmodels accepts pandas DataFrames; convert with `df.to_pandas()` if using Polars
- **plotnine**: Publication-quality visualization of model results and diagnostics

## Quick Decision Trees

### "I need to fit a regression model"

```
What kind of regression?
├─ Linear (continuous outcome)
│   ├─ Basic OLS → ./references/linear-models.md
│   ├─ Weighted least squares → ./references/linear-models.md
│   │   (⚠ WLS ≠ survey-weighted regression — for complex surveys, use `svy` skill)
│   ├─ Correlated errors (GLS) → ./references/linear-models.md
│   ├─ Robust to outliers (M-estimator) → ./references/linear-models.md
│   └─ Quantile regression → ./references/linear-models.md
├─ Binary outcome (0/1)
│   ├─ Logit → ./references/glm-discrete.md
│   └─ Probit → ./references/glm-discrete.md
├─ Count outcome (0, 1, 2, ...)
│   ├─ Poisson → ./references/glm-discrete.md
│   ├─ Negative binomial → ./references/glm-discrete.md
│   └─ Zero-inflated → ./references/glm-discrete.md
├─ Multinomial (3+ categories)
│   └─ Multinomial logit → ./references/glm-discrete.md
├─ GLM (custom family/link)
│   └─ GLM framework → ./references/glm-discrete.md
└─ Need fixed effects?
    └─ Use pyfixest instead (faster FE absorption)
```

### "I need to analyze time series"

```
What time series task?
├─ Forecast a single series
│   ├─ ARIMA / SARIMAX → ./references/time-series.md
│   └─ Exponential smoothing → ./references/time-series.md
├─ Multiple interrelated series
│   └─ VAR / VECM → ./references/time-series.md
├─ Test for stationarity
│   ├─ ADF test → ./references/time-series.md
│   └─ KPSS test → ./references/time-series.md
├─ Examine autocorrelation
│   └─ ACF / PACF → ./references/time-series.md
└─ Structural time series
    └─ Unobserved components → ./references/time-series.md
```

### "I need to check model assumptions"

```
What assumption to check?
├─ Heteroskedasticity → ./references/diagnostics.md
│   ├─ Breusch-Pagan test
│   └─ White test
├─ Normality of residuals → ./references/diagnostics.md
│   ├─ Jarque-Bera test
│   └─ Shapiro-Wilk test
├─ Specification / functional form → ./references/diagnostics.md
│   └─ RESET test
├─ Multicollinearity → ./references/diagnostics.md
│   ├─ VIF
│   └─ Condition number
├─ Influential observations → ./references/diagnostics.md
│   ├─ Cook's distance
│   └─ Leverage / DFFITS
├─ Serial correlation → ./references/diagnostics.md
│   └─ Durbin-Watson / Breusch-Godfrey
└─ All of the above → ./references/diagnostics.md
```

### "I need to test hypotheses"

```
What kind of test?
├─ Single coefficient significance → ./references/hypothesis-testing.md
├─ Joint significance (F-test) → ./references/hypothesis-testing.md
├─ Linear restrictions (Wald) → ./references/hypothesis-testing.md
├─ Compare nested models (LR test) → ./references/hypothesis-testing.md
├─ Multiple comparisons correction → ./references/hypothesis-testing.md
└─ Chi-squared test → ./references/hypothesis-testing.md
```

### "Something isn't working"

```
Common issues?
├─ Missing constant / intercept → ./references/gotchas.md
├─ Convergence warnings → ./references/gotchas.md
├─ predict() errors → ./references/gotchas.md
├─ Formula parsing issues → ./references/gotchas.md
├─ summary() formatting → ./references/gotchas.md
├─ statsmodels vs pyfixest → ./references/gotchas.md
└─ General troubleshooting → ./references/gotchas.md
```

## File-First Execution in Research Workflows

**Important:** In data research pipelines (see `CLAUDE.md`), statsmodels analyses are executed through **script files**, not interactively. This ensures auditability and reproducibility.

**The pattern:**
1. Write model code to `scripts/stage8_analysis/{step}_{model-name}.py`
2. Execute via Bash with automatic output capture wrapper script
3. Validation results get automatically embedded in scripts as comments
4. If failed, create versioned copy for fixes

Closely read `agent_reference/SCRIPT_EXECUTION_REFERENCE.md` for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules.

**See:**
- `agent_reference/SCRIPT_EXECUTION_REFERENCE.md` — Script execution protocol and format with validation

The examples below show statsmodels syntax. In research workflows, wrap them in scripts following the file-first pattern.

---

## Quick Reference

### Essential Imports

```python
import statsmodels.api as sm           # Array API
import statsmodels.formula.api as smf  # Formula API (R-style)
```

### Core Operations

| Operation | Code |
|-----------|------|
| OLS (formula) | `smf.ols("y ~ x1 + x2", data=df).fit()` |
| OLS (array) | `sm.OLS(y, sm.add_constant(X)).fit()` |
| Logit | `smf.logit("y ~ x1 + x2", data=df).fit()` |
| Probit | `smf.probit("y ~ x1 + x2", data=df).fit()` |
| Poisson | `smf.poisson("y ~ x1 + x2", data=df).fit()` |
| GLM (custom) | `smf.glm("y ~ x1", data=df, family=sm.families.Binomial()).fit()` |
| WLS | `smf.wls("y ~ x1", data=df, weights=w).fit()` |
| Robust (HC1) | `fit = smf.ols(...).fit(cov_type='HC1')` |
| ARIMA | `sm.tsa.ARIMA(y, order=(p,d,q)).fit()` |
| Summary | `results.summary()` |
| Predict | `results.predict(new_data)` |
| Confidence intervals | `results.conf_int(alpha=0.05)` |
| Marginal effects | `results.get_margeff(at='overall')` |
| VIF | `from statsmodels.stats.outliers_influence import variance_inflation_factor` |
| Breusch-Pagan | `sm.stats.diagnostic.het_breuschpagan(resid, exog)` |

### Formula Syntax

```python
# Additive terms
"y ~ x1 + x2 + x3"

# Interaction (with main effects)
"y ~ x1 * x2"           # equivalent to x1 + x2 + x1:x2

# Interaction only (no main effects)
"y ~ x1 : x2"

# Categorical variable
"y ~ C(region)"          # treatment coding (default)
"y ~ C(region, Treatment(reference='West'))"  # explicit reference

# Suppress intercept
"y ~ x1 + x2 - 1"

# Polynomial
"y ~ x1 + I(x1**2)"     # I() protects Python operators
```

## Topic Index

| Topic | Reference File |
|-------|---------------|
| Installation | `./references/quickstart.md` |
| Formula vs array API | `./references/quickstart.md` |
| Reading summary output | `./references/quickstart.md` |
| Comparison to pyfixest | `./references/quickstart.md` |
| OLS regression | `./references/linear-models.md` |
| Weighted least squares | `./references/linear-models.md` |
| GLS | `./references/linear-models.md` |
| Robust regression (RLM) | `./references/linear-models.md` |
| Quantile regression | `./references/linear-models.md` |
| Interactions and polynomials | `./references/linear-models.md` |
| GLM framework | `./references/glm-discrete.md` |
| Logit / probit | `./references/glm-discrete.md` |
| Multinomial logit | `./references/glm-discrete.md` |
| Poisson / negative binomial | `./references/glm-discrete.md` |
| Zero-inflated models | `./references/glm-discrete.md` |
| Marginal effects | `./references/glm-discrete.md` |
| Exposure / offset | `./references/glm-discrete.md` |
| ARIMA / SARIMAX | `./references/time-series.md` |
| VAR / VECM | `./references/time-series.md` |
| Exponential smoothing | `./references/time-series.md` |
| Unit root tests | `./references/time-series.md` |
| ACF / PACF | `./references/time-series.md` |
| Forecasting | `./references/time-series.md` |
| State space models | `./references/time-series.md` |
| Heteroskedasticity tests | `./references/diagnostics.md` |
| Normality tests | `./references/diagnostics.md` |
| Specification tests (RESET) | `./references/diagnostics.md` |
| VIF / multicollinearity | `./references/diagnostics.md` |
| Influence measures | `./references/diagnostics.md` |
| Residual analysis | `./references/diagnostics.md` |
| Durbin-Watson | `./references/diagnostics.md` |
| t-tests and F-tests | `./references/hypothesis-testing.md` |
| Wald tests | `./references/hypothesis-testing.md` |
| Likelihood ratio tests | `./references/hypothesis-testing.md` |
| Multiple comparison corrections | `./references/hypothesis-testing.md` |
| Comparing nested models | `./references/hypothesis-testing.md` |
| Serial correlation tests | `./references/diagnostics.md` |
| Diagnostic checklist | `./references/diagnostics.md` |
| Chi-squared tests | `./references/hypothesis-testing.md` |
| Joint significance tests | `./references/hypothesis-testing.md` |
| Ordered logit / probit | `./references/glm-discrete.md` |
| Mixed effects (MixedLM) | `./references/linear-models.md` |
| Constant term pitfall | `./references/gotchas.md` |
| Convergence warnings | `./references/gotchas.md` |
| predict() issues | `./references/gotchas.md` |
| Formula parsing (patsy) | `./references/gotchas.md` |
| summary() vs summary2() | `./references/gotchas.md` |
| NaN / missing data | `./references/gotchas.md` |
| DataFrame index issues | `./references/gotchas.md` |
| statsmodels vs pyfixest | `./references/gotchas.md` |

## Citation

When this library is used as a primary analytical tool, include in the report's
Software & Tools references:

> Seabold, S. & Perktold, J. (2010). "Statsmodels: Econometric and Statistical Modeling with Python." *Proceedings of the 9th Python in Science Conference*.

**Cite when:** statsmodels is used for GLM estimation, time series modeling, or statistical hypothesis testing central to the analysis.
**Do not cite when:** Only used for post-estimation diagnostics supporting another library's primary estimation.

For method-specific citations (e.g., individual estimators or techniques),
consult the reference files in this skill and `agent_reference/CITATION_REFERENCE.md`.

Related Skills

svy

160

from DAAF-Contribution-Community/daaf

Complex survey analysis: strata/PSU/weights, variance estimation (Taylor, BRR, jackknife, bootstrap), survey GLM, domain analysis, calibration. Polars-native. Use for NHANES, CPS, ACS PUMS, BRFSS, DHS. Non-survey regression: statsmodels/pyfixest.

stata-python-translation

160

from DAAF-Contribution-Community/daaf

Stata-to-Python translation for data analysis. Maps Stata commands (reghdfe, xtreg, ivregress, margins, esttab, svy:) to Python (polars, pyfixest, statsmodels, svy). Use when user has Stata background or requests Stata-equivalent code comments.

skill-authoring

160

from DAAF-Contribution-Community/daaf

Guide for creating and auditing DAAF skills (SKILL.md). Covers frontmatter, metadata vocabulary, progressive disclosure, decision trees, reference files. Use when creating, reviewing, or debugging skill loading. For agent files, use agent-authoring.

science-communication

160

from DAAF-Contribution-Community/daaf

Translating technical findings for non-technical audiences. Narrative frameworks (Pyramid Principle, SCQA), plain-language translation, executive summaries, policy briefs, causal language. Use when presenting to stakeholders or reviewing deliverables

r-python-translation

160

from DAAF-Contribution-Community/daaf

R-to-Python translation for data analysis. Maps R packages (tidyverse, ggplot2, fixest, survey, sf, plm) to Python equivalents (polars, plotnine, pyfixest, svy, geopandas). Use when user has R background or requests R-equivalent code comments.

pyfixest

160

from DAAF-Contribution-Community/daaf

Fast high-dimensional fixed effects: OLS, Poisson, IV with multi-way FE; DiD (TWFE, did2s, Sun-Abraham); clustered SEs; etable/coefplot/iplot. Use for FE regressions or DiD. For panel RE/between use linearmodels; for GLM without FE use statsmodels.

polars

160

from DAAF-Contribution-Community/daaf

Polars DataFrame library for high-performance data manipulation. Lazy/eager execution, expressions, I/O (CSV, Parquet, JSON), aggregations, joins, string/datetime ops, pandas interop. Use for Polars DataFrames or reading/writing Parquet files.

plotnine

160

from DAAF-Contribution-Community/daaf

plotnine static visualization (ggplot2 syntax for Python). Geoms, aesthetics, scales, coordinates, facets, themes. Use for static publication-quality figures with grammar-of-graphics syntax. For interactive charts use plotly; for maps use geopandas.

plotly

160

from DAAF-Contribution-Community/daaf

Plotly interactive visualization. Express and Graph Objects: scatter, line, bar, heatmap, 3D, geographic charts; subplots; styling; export. Use when interactivity (hover/zoom) is needed. For static figures use plotnine; for GIS use geopandas.

marimo

160

from DAAF-Contribution-Community/daaf

Reactive Python notebook system. Cell reactivity, UI elements (sliders, dropdowns, tables), SQL cells, plotting, app deployment. Use when assembling Stage 9 notebooks, building data apps, or converting Jupyter to marimo .py format.

linearmodels

160

from DAAF-Contribution-Community/daaf

Panel data, IV/GMM, system regression. PanelOLS (FE/RE), BetweenOLS, Fama-MacBeth, IV2SLS/LIML/GMM, SUR, 3SLS, Driscoll-Kraay SEs. Use for RE/between, system estimation, or GMM. Complements pyfixest (FE + DiD) and statsmodels (GLM + time series).

geopandas

160

from DAAF-Contribution-Community/daaf

Spatial data: GeoDataFrames, spatial joins, CRS/projections, choropleth/interactive maps, spatial autocorrelation, PySAL. Use for geographic data, spatial files (Shapefile, GeoPackage, GeoParquet), or spatial stats. For charts without GIS use plotly.