data-analysis
Scientific data analysis including data cleaning, exploratory data analysis (EDA), statistical testing, regression, and reporting. Uses Python with pandas, scipy, statsmodels, scikit-learn. Use when user asks to analyze data, clean a dataset, run statistics, do EDA, fit a model, or process CSV/Excel files. Triggers on "analyze this data", "clean my dataset", "run regression", "EDA", "descriptive statistics", "data processing", "correlation analysis".
Best use case
data-analysis is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Scientific data analysis including data cleaning, exploratory data analysis (EDA), statistical testing, regression, and reporting. Uses Python with pandas, scipy, statsmodels, scikit-learn. Use when user asks to analyze data, clean a dataset, run statistics, do EDA, fit a model, or process CSV/Excel files. Triggers on "analyze this data", "clean my dataset", "run regression", "EDA", "descriptive statistics", "data processing", "correlation analysis".
Teams using data-analysis should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/data-analysis/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How data-analysis Compares
| Feature / Agent | data-analysis | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Scientific data analysis including data cleaning, exploratory data analysis (EDA), statistical testing, regression, and reporting. Uses Python with pandas, scipy, statsmodels, scikit-learn. Use when user asks to analyze data, clean a dataset, run statistics, do EDA, fit a model, or process CSV/Excel files. Triggers on "analyze this data", "clean my dataset", "run regression", "EDA", "descriptive statistics", "data processing", "correlation analysis".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
# Data Analysis
Scientific data analysis with Python. All scripts use the venv at `/Users/zhangmingda/clawd/.venv`.
## Setup
```bash
source /Users/zhangmingda/clawd/.venv/bin/activate
```
## Workflow
### 1. Data Loading
```python
import pandas as pd
import numpy as np
# CSV
df = pd.read_csv('data.csv')
# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON
df = pd.read_json('data.json')
# Clipboard (from user paste)
# Save user's data to a temp file first, then read
# Quick inspection
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(df.dtypes)
print(df.describe())
print(f"Missing values:\n{df.isnull().sum()}")
```
### 2. Data Cleaning
```python
# Missing values
df.dropna(subset=['critical_column'])
df['col'].fillna(df['col'].median(), inplace=True)
# Duplicates
df.drop_duplicates(inplace=True)
# Outliers (IQR method)
Q1, Q3 = df['col'].quantile([0.25, 0.75])
IQR = Q3 - Q1
mask = (df['col'] >= Q1 - 1.5*IQR) & (df['col'] <= Q3 + 1.5*IQR)
df_clean = df[mask]
# Type conversion
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')
```
### 3. Exploratory Data Analysis
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Distribution
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for i, col in enumerate(numeric_cols[:4]):
ax = axes[i//2, i%2]
sns.histplot(df[col], kde=True, ax=ax)
ax.set_title(col)
plt.tight_layout()
plt.savefig('distributions.png', dpi=150)
# Correlation matrix
corr = df[numeric_cols].corr()
sns.heatmap(corr, annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.savefig('correlation.png', dpi=150)
# Pairplot for key variables
sns.pairplot(df[key_cols], hue='group')
plt.savefig('pairplot.png', dpi=150)
```
### 4. Statistical Tests
Choose test based on:
- **Data type**: continuous vs categorical
- **Distribution**: normal vs non-normal (Shapiro-Wilk test)
- **Groups**: 2 vs 3+ groups
- **Pairing**: independent vs paired/repeated
| Scenario | Normal | Non-normal |
|----------|--------|------------|
| 2 independent groups | Independent t-test | Mann-Whitney U |
| 2 paired groups | Paired t-test | Wilcoxon signed-rank |
| 3+ independent groups | One-way ANOVA | Kruskal-Wallis |
| 3+ paired groups | Repeated measures ANOVA | Friedman |
| Association (continuous) | Pearson r | Spearman ρ |
| Association (categorical) | Chi-square | Fisher's exact |
```python
from scipy import stats
# Normality test
stat, p = stats.shapiro(df['col'])
print(f"Shapiro-Wilk: W={stat:.4f}, p={p:.4f}")
# t-test
t, p = stats.ttest_ind(group1, group2)
# Effect size (Cohen's d)
d = (group1.mean() - group2.mean()) / np.sqrt((group1.std()**2 + group2.std()**2) / 2)
# ANOVA
f, p = stats.f_oneway(g1, g2, g3)
# Chi-square
chi2, p, dof, expected = stats.chi2_contingency(pd.crosstab(df['a'], df['b']))
# Correlation
r, p = stats.pearsonr(df['x'], df['y'])
```
### 5. Regression
```python
import statsmodels.api as sm
import statsmodels.formula.api as smf
# OLS
model = smf.ols('y ~ x1 + x2 + C(group)', data=df).fit()
print(model.summary())
# Logistic
model = smf.logit('outcome ~ x1 + x2', data=df).fit()
print(model.summary())
# Mixed effects
model = smf.mixedlm('y ~ x1 + x2', data=df, groups=df['subject']).fit()
```
### 6. Reporting
Always report:
- Sample size (N) and any exclusions
- Descriptive statistics (M, SD or Median, IQR)
- Test statistic, degrees of freedom, p-value
- Effect size with confidence interval
- Assumptions checked (normality, homogeneity of variance)
Format: "A significant difference was found between groups, t(48) = 2.31, p = .025, Cohen's d = 0.65, 95% CI [0.08, 1.22]."
## Tips
- Always check assumptions before parametric tests
- Report effect sizes, not just p-values
- Use Bonferroni or FDR correction for multiple comparisons
- Visualize data before and after analysis
- Save all outputs as files the user can downloadRelated Skills
world-bank-data
World Bank Open Data API for development indicators. Use when: user asks about GDP, population, poverty, health, or education statistics by country. NOT for: real-time financial data or stock prices.
wikidata-knowledge
Query Wikidata for structured knowledge using SPARQL and entity search. Use when: (1) finding structured facts about entities (people, places, organizations), (2) querying relationships between entities, (3) cross-referencing external identifiers (Wikipedia, VIAF, GND, ORCID), (4) building knowledge graphs from linked data. NOT for: full-text article content (use Wikipedia API), scientific literature (use semantic-scholar), geospatial data (use OpenStreetMap).
uniprot-database
Direct REST API access to UniProt. Protein searches, FASTA retrieval, ID mapping, Swiss-Prot/TrEMBL. For Python workflows with multiple databases, prefer bioservices (unified interface to 40+ services). Use this for direct HTTP/REST work or UniProt-specific control.
string-database
Query STRING API for protein-protein interactions (59M proteins, 20B interactions). Network analysis, GO/KEGG enrichment, interaction discovery, 5000+ species, for systems biology.
statistical-analysis
Guided statistical analysis with test selection and reporting. Use when you need help choosing appropriate tests for your data, assumption checking, power analysis, and APA-formatted results. Best for academic research reporting, test selection guidance. For implementing specific models programmatically use statsmodels.
social-science-analysis
Social science research methods including survey design, qualitative analysis, content analysis, network analysis, psychometrics, and mixed methods. Covers sociology, psychology, political science, education, and communication studies. Use when user designs surveys, analyzes qualitative data, does content analysis, builds scales, or uses mixed methods. Triggers on "survey design", "qualitative analysis", "content analysis", "Likert scale", "thematic analysis", "grounded theory", "factor analysis", "SEM", "structural equation", "psychometrics", "interview coding".
scipy-analysis
Scientific computing and statistical analysis with SciPy, NumPy, and pandas. Use when: (1) statistical hypothesis testing, (2) optimization problems, (3) signal processing, (4) numerical integration, (5) data manipulation and analysis. NOT for: symbolic math (use sympy-math), machine learning (use sklearn directly), or visualization (use matplotlib-viz).
reactome-database
Query Reactome REST API for pathway analysis, enrichment, gene-pathway mapping, disease pathways, molecular interactions, expression analysis, for systems biology studies.
pubmed-database
Direct REST API access to PubMed. Advanced Boolean/MeSH queries, E-utilities API, batch processing, citation management. For Python workflows, prefer biopython (Bio.Entrez). Use this for direct HTTP/REST work or custom API implementations.
pubchem-database
Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics.
pdb-database
Access RCSB PDB for 3D protein/nucleic acid structures. Search by text/sequence/structure, download coordinates (PDB/mmCIF), retrieve metadata, for structural biology and drug discovery.
patent-analysis
Conducts patent landscape analysis including prior art searches, patent claim interpretation, freedom-to-operate assessment, and intellectual property strategy for scientific inventions; trigger when users discuss patents, prior art, IP protection, or technology licensing.