survey-data-processing
Clean, recode, and prepare survey response data for analysis
Best use case
survey-data-processing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Clean, recode, and prepare survey response data for analysis
Teams using survey-data-processing should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/survey-data-processing/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How survey-data-processing Compares
| Feature / Agent | survey-data-processing | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Clean, recode, and prepare survey response data for analysis
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Survey Data Processing
A skill for cleaning, recoding, and preparing survey response data for statistical analysis. Covers handling common survey data issues such as incomplete responses, attention check failures, reverse-coded items, scale construction, open-ended response coding, and export to analysis-ready formats compatible with SPSS, Stata, and R.
## Survey Data Quality Assessment
### Initial Inspection Workflow
Survey data from platforms like Qualtrics, SurveyMonkey, REDCap, and Google Forms each have their own export formats and quirks. The first step is always standardization.
```python
import pandas as pd
import numpy as np
def assess_survey_quality(df, duration_col="duration_seconds",
min_duration=60):
"""
Generate a survey data quality report.
Checks:
- Completion rates per question
- Response duration (speeders and slow responders)
- Straight-line responding patterns
- Attention check failures
"""
report = {}
# Overall completion
total_respondents = len(df)
complete = df.dropna(thresh=int(len(df.columns) * 0.8))
report["total_responses"] = total_respondents
report["substantially_complete"] = len(complete)
report["completion_rate"] = f"{len(complete)/total_respondents*100:.1f}%"
# Duration analysis
if duration_col in df.columns:
durations = df[duration_col].dropna()
report["median_duration_seconds"] = durations.median()
report["speeders"] = (durations < min_duration).sum()
report["speeder_pct"] = f"{(durations < min_duration).mean()*100:.1f}%"
# Missing data per question
missing_by_col = df.isna().sum().sort_values(ascending=False)
report["most_skipped_questions"] = missing_by_col.head(10).to_dict()
return report
```
### Identifying Low-Quality Responses
```python
def detect_straightlining(df, likert_columns, threshold=0.9):
"""
Detect respondents who select the same answer for nearly
all Likert-scale questions (straight-line responding).
A respondent is flagged if the proportion of their most
common response exceeds the threshold.
"""
flagged = []
for idx, row in df[likert_columns].iterrows():
responses = row.dropna()
if len(responses) == 0:
continue
most_common_pct = responses.value_counts().iloc[0] / len(responses)
if most_common_pct >= threshold:
flagged.append(idx)
return flagged
def check_attention_items(df, attention_checks):
"""
Validate attention check (trap) questions.
Args:
attention_checks: dict of {column_name: correct_answer}
Example: {"q15_attention": 4, "q32_trap": "strongly agree"}
"""
failed = pd.Series(False, index=df.index)
for col, correct in attention_checks.items():
failed = failed | (df[col] != correct)
return df.index[failed].tolist()
```
## Recoding and Transformation
### Reverse Coding
Many validated psychological scales include reverse-coded items to detect acquiescence bias. These must be recoded before computing scale scores.
```python
def reverse_code(df, columns, scale_max, scale_min=1):
"""
Reverse-code specified columns for Likert-type scales.
Formula: reversed = (scale_max + scale_min) - original
Example for a 1-5 scale:
1 -> 5, 2 -> 4, 3 -> 3, 4 -> 2, 5 -> 1
"""
df_recoded = df.copy()
for col in columns:
df_recoded[col] = (scale_max + scale_min) - df[col]
return df_recoded
# Example usage with a Big Five personality scale
reverse_items = {
"extraversion": ["ext_2", "ext_4", "ext_6"],
"neuroticism": ["neur_1", "neur_3", "neur_5"],
"agreeableness": ["agree_3", "agree_5"],
}
# For a 1-7 Likert scale:
for construct, items in reverse_items.items():
df = reverse_code(df, items, scale_max=7, scale_min=1)
```
### Scale Construction
```python
def compute_scale_scores(df, scale_definitions, method="mean"):
"""
Compute composite scale scores from individual items.
Args:
scale_definitions: dict mapping scale name to list of columns
method: "mean" or "sum"
Returns:
DataFrame with new scale score columns
"""
for scale_name, items in scale_definitions.items():
if method == "mean":
df[scale_name] = df[items].mean(axis=1)
elif method == "sum":
df[scale_name] = df[items].sum(axis=1)
# Also compute Cronbach's alpha for reliability
alpha = cronbachs_alpha(df[items])
print(f"{scale_name}: alpha = {alpha:.3f} "
f"(n_items = {len(items)})")
return df
def cronbachs_alpha(item_df):
"""
Compute Cronbach's alpha for internal consistency reliability.
Values above 0.70 are generally considered acceptable.
"""
item_df = item_df.dropna()
n_items = item_df.shape[1]
if n_items < 2:
return np.nan
item_variances = item_df.var(axis=0, ddof=1)
total_variance = item_df.sum(axis=1).var(ddof=1)
alpha = (n_items / (n_items - 1)) * (
1 - item_variances.sum() / total_variance
)
return alpha
```
## Open-Ended Response Processing
### Coding Qualitative Responses
```python
def code_open_responses(df, text_column, codebook):
"""
Apply a predefined codebook to open-ended responses using
keyword matching. For research-quality coding, this should
be supplemented with manual coding by trained raters.
Args:
codebook: dict mapping code names to keyword lists
Example: {
"financial_concern": ["money", "cost", "expensive", "afford"],
"time_constraint": ["time", "busy", "schedule", "hours"],
"quality_issue": ["quality", "broken", "defect", "poor"],
}
"""
for code_name, keywords in codebook.items():
pattern = "|".join(keywords)
df[f"code_{code_name}"] = (
df[text_column]
.str.lower()
.str.contains(pattern, na=False)
.astype(int)
)
return df
```
### Inter-Rater Reliability
```
When multiple coders classify open-ended responses:
Cohen's Kappa (2 raters):
- < 0.20: poor agreement
- 0.21-0.40: fair
- 0.41-0.60: moderate
- 0.61-0.80: substantial
- 0.81-1.00: almost perfect
Fleiss' Kappa (3+ raters):
- Same interpretation scale as Cohen's
- Use when more than two raters code the same responses
Process:
1. Develop codebook with definitions and examples
2. Train coders on 10-20 practice responses
3. Code 20% of responses independently (overlap set)
4. Calculate inter-rater reliability on the overlap set
5. If kappa < 0.70, discuss disagreements and refine codebook
6. Repeat until acceptable reliability is achieved
7. Divide remaining responses among coders
```
## Data Reshaping for Analysis
### Wide to Long Format
Survey data is typically exported in wide format (one row per respondent, one column per question). Many analyses require long format.
```python
def reshape_repeated_measures(df, id_col, time_points,
measure_prefix):
"""
Reshape repeated-measures survey data from wide to long.
Example: columns q1_pre, q1_post -> long format with
time column ("pre", "post") and value column.
"""
value_vars = [f"{measure_prefix}_{t}" for t in time_points]
long_df = pd.melt(
df,
id_vars=[id_col],
value_vars=value_vars,
var_name="time_point",
value_name=measure_prefix
)
# Clean time_point column
long_df["time_point"] = (
long_df["time_point"]
.str.replace(f"{measure_prefix}_", "")
)
return long_df
```
## Export for Statistical Software
```
Export formats by software:
SPSS (.sav):
- Use pyreadstat: pyreadstat.write_sav(df, "output.sav")
- Include variable labels and value labels
- Set measurement level (nominal, ordinal, scale)
Stata (.dta):
- Use pandas: df.to_stata("output.dta")
- Include variable labels via write_stata with labels dict
R (.csv with codebook):
- Export CSV plus a separate codebook document
- Or use pyreadstat to write .rds format
- Include factor level definitions
General best practices:
- Include a unique respondent ID column
- Use numeric codes for categorical variables (with labels)
- Document all recoding in a companion codebook
- Save both raw and processed versions
- Include a timestamp column for data versioning
```
Proper survey data processing is essential for valid statistical inference. Decisions made during cleaning and recoding directly affect research conclusions, making transparent documentation of every step a methodological requirement rather than a convenience.Related Skills
json-data-visualizer
Guide to JSON Crack for visualizing complex JSON data structures
datagen-research-guide
AI-driven multi-agent research assistant for end-to-end studies
data-collection-automation
Automate survey deployment, data collection, and pipeline management
database-comparison-guide
Compare major academic databases and when to use each for research
wikidata-api-guide
Query Wikidata SPARQL for scholarly metadata, authors, and entities
datacite-api
Resolve dataset DOIs and query research data metadata via DataCite
crossref-event-data-api
Track scholarly mentions across the web via Crossref Event Data
metadata-skills
24 metadata & bibliometrics skills. Trigger: DOI resolution, citation metrics, author disambiguation, bibliometrics. Design: metadata APIs and bibliometric analysis tools for scholarly records.
dataverse-api
Deposit and discover research datasets via Harvard Dataverse API
survey-research-guide
Design, deploy, and analyze surveys for social science and organizational res...
ipums-microdata-api
Access harmonized census and survey microdata via the IPUMS API
astrophysics-data-guide
Astronomical data processing with Astropy, FITS files, and sky surveys