senior-data-scientist

World-class senior data scientist skill specialising in statistical modeling, experiment design, causal inference, and predictive analytics. Covers A/B testing (sample sizing, two-proportion z-tests, Bonferroni correction), difference-in-differences, feature engineering pipelines (Scikit-learn, XGBoost), cross-validated model evaluation (AUC-ROC, AUC-PR, SHAP), and MLflow experiment tracking — using Python (NumPy, Pandas, Scikit-learn), R, and SQL. Use when designing or analysing controlled experiments, building and evaluating classification or regression models, performing causal analysis on observational data, engineering features for structured tabular datasets, or translating statistical findings into data-driven business decisions.

3,891 stars

Best use case

senior-data-scientist is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

World-class senior data scientist skill specialising in statistical modeling, experiment design, causal inference, and predictive analytics. Covers A/B testing (sample sizing, two-proportion z-tests, Bonferroni correction), difference-in-differences, feature engineering pipelines (Scikit-learn, XGBoost), cross-validated model evaluation (AUC-ROC, AUC-PR, SHAP), and MLflow experiment tracking — using Python (NumPy, Pandas, Scikit-learn), R, and SQL. Use when designing or analysing controlled experiments, building and evaluating classification or regression models, performing causal analysis on observational data, engineering features for structured tabular datasets, or translating statistical findings into data-driven business decisions.

Teams using senior-data-scientist should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/senior-data-scientist/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/alirezarezvani/senior-data-scientist/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/senior-data-scientist/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How senior-data-scientist Compares

Feature / Agentsenior-data-scientistStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

World-class senior data scientist skill specialising in statistical modeling, experiment design, causal inference, and predictive analytics. Covers A/B testing (sample sizing, two-proportion z-tests, Bonferroni correction), difference-in-differences, feature engineering pipelines (Scikit-learn, XGBoost), cross-validated model evaluation (AUC-ROC, AUC-PR, SHAP), and MLflow experiment tracking — using Python (NumPy, Pandas, Scikit-learn), R, and SQL. Use when designing or analysing controlled experiments, building and evaluating classification or regression models, performing causal analysis on observational data, engineering features for structured tabular datasets, or translating statistical findings into data-driven business decisions.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Senior Data Scientist

World-class senior data scientist skill for production-grade AI/ML/Data systems.

## Core Workflows

### 1. Design an A/B Test

```python
import numpy as np
from scipy import stats

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    """
    Calculate required sample size per variant.
    baseline_rate: current conversion rate (e.g. 0.10)
    mde: minimum detectable effect (relative, e.g. 0.05 = 5% lift)
    """
    p1 = baseline_rate
    p2 = baseline_rate * (1 + mde)
    effect_size = abs(p2 - p1) / np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / 2)
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    n = ((z_alpha + z_beta) / effect_size) ** 2
    return int(np.ceil(n))

def analyze_experiment(control, treatment, alpha=0.05):
    """
    Run two-proportion z-test and return structured results.
    control/treatment: dicts with 'conversions' and 'visitors'.
    """
    p_c = control["conversions"] / control["visitors"]
    p_t = treatment["conversions"] / treatment["visitors"]
    pooled = (control["conversions"] + treatment["conversions"]) / (control["visitors"] + treatment["visitors"])
    se = np.sqrt(pooled * (1 - pooled) * (1 / control["visitors"] + 1 / treatment["visitors"]))
    z = (p_t - p_c) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    ci_low = (p_t - p_c) - stats.norm.ppf(1 - alpha / 2) * se
    ci_high = (p_t - p_c) + stats.norm.ppf(1 - alpha / 2) * se
    return {
        "lift": (p_t - p_c) / p_c,
        "p_value": p_value,
        "significant": p_value < alpha,
        "ci_95": (ci_low, ci_high),
    }

# --- Experiment checklist ---
# 1. Define ONE primary metric and pre-register secondary metrics.
# 2. Calculate sample size BEFORE starting: calculate_sample_size(0.10, 0.05)
# 3. Randomise at the user (not session) level to avoid leakage.
# 4. Run for at least 1 full business cycle (typically 2 weeks).
# 5. Check for sample ratio mismatch: abs(n_control - n_treatment) / expected < 0.01
# 6. Analyze with analyze_experiment() and report lift + CI, not just p-value.
# 7. Apply Bonferroni correction if testing multiple metrics: alpha / n_metrics
```

### 2. Build a Feature Engineering Pipeline

```python
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

def build_feature_pipeline(numeric_cols, categorical_cols, date_cols=None):
    """
    Returns a fitted-ready ColumnTransformer for structured tabular data.
    """
    numeric_pipeline = Pipeline([
        ("impute", SimpleImputer(strategy="median")),
        ("scale",  StandardScaler()),
    ])
    categorical_pipeline = Pipeline([
        ("impute", SimpleImputer(strategy="most_frequent")),
        ("encode", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
    ])
    transformers = [
        ("num", numeric_pipeline, numeric_cols),
        ("cat", categorical_pipeline, categorical_cols),
    ]
    return ColumnTransformer(transformers, remainder="drop")

def add_time_features(df, date_col):
    """Extract cyclical and lag features from a datetime column."""
    df = df.copy()
    df[date_col] = pd.to_datetime(df[date_col])
    df["dow_sin"] = np.sin(2 * np.pi * df[date_col].dt.dayofweek / 7)
    df["dow_cos"] = np.cos(2 * np.pi * df[date_col].dt.dayofweek / 7)
    df["month_sin"] = np.sin(2 * np.pi * df[date_col].dt.month / 12)
    df["month_cos"] = np.cos(2 * np.pi * df[date_col].dt.month / 12)
    df["is_weekend"] = (df[date_col].dt.dayofweek >= 5).astype(int)
    return df

# --- Feature engineering checklist ---
# 1. Never fit transformers on the full dataset — fit on train, transform test.
# 2. Log-transform right-skewed numeric features before scaling.
# 3. For high-cardinality categoricals (>50 levels), use target encoding or embeddings.
# 4. Generate lag/rolling features BEFORE the train/test split to avoid leakage.
# 5. Document each feature's business meaning alongside its code.
```

### 3. Train, Evaluate, and Select a Prediction Model

```python
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, roc_auc_score, average_precision_score
import xgboost as xgb
import mlflow

SCORERS = {
    "roc_auc":  make_scorer(roc_auc_score, needs_proba=True),
    "avg_prec": make_scorer(average_precision_score, needs_proba=True),
}

def evaluate_model(model, X, y, cv=5):
    """
    Cross-validate and return mean ± std for each scorer.
    Use StratifiedKFold for classification to preserve class balance.
    """
    cv_results = cross_validate(
        model, X, y,
        cv=StratifiedKFold(n_splits=cv, shuffle=True, random_state=42),
        scoring=SCORERS,
        return_train_score=True,
    )
    summary = {}
    for metric in SCORERS:
        test_scores = cv_results[f"test_{metric}"]
        summary[metric] = {"mean": test_scores.mean(), "std": test_scores.std()}
        # Flag overfitting: large gap between train and test score
        train_mean = cv_results[f"train_{metric}"].mean()
        summary[metric]["overfit_gap"] = train_mean - test_scores.mean()
    return summary

def train_and_log(model, X_train, y_train, X_test, y_test, run_name):
    """Train model and log all artefacts to MLflow."""
    with mlflow.start_run(run_name=run_name):
        model.fit(X_train, y_train)
        proba = model.predict_proba(X_test)[:, 1]
        metrics = {
            "roc_auc":  roc_auc_score(y_test, proba),
            "avg_prec": average_precision_score(y_test, proba),
        }
        mlflow.log_params(model.get_params())
        mlflow.log_metrics(metrics)
        mlflow.sklearn.log_model(model, "model")
        return metrics

# --- Model evaluation checklist ---
# 1. Always report AUC-PR alongside AUC-ROC for imbalanced datasets.
# 2. Check overfit_gap > 0.05 as a warning sign of overfitting.
# 3. Calibrate probabilities (Platt scaling / isotonic) before production use.
# 4. Compute SHAP values to validate feature importance makes business sense.
# 5. Run a baseline (e.g. DummyClassifier) and verify the model beats it.
# 6. Log every run to MLflow — never rely on notebook output for comparison.
```

### 4. Causal Inference: Difference-in-Differences

```python
import statsmodels.formula.api as smf

def diff_in_diff(df, outcome, treatment_col, post_col, controls=None):
    """
    Estimate ATT via OLS DiD with optional covariates.
    df must have: outcome, treatment_col (0/1), post_col (0/1).
    Returns the interaction coefficient (treatment × post) and its p-value.
    """
    covariates = " + ".join(controls) if controls else ""
    formula = (
        f"{outcome} ~ {treatment_col} * {post_col}"
        + (f" + {covariates}" if covariates else "")
    )
    result = smf.ols(formula, data=df).fit(cov_type="HC3")
    interaction = f"{treatment_col}:{post_col}"
    return {
        "att":     result.params[interaction],
        "p_value": result.pvalues[interaction],
        "ci_95":   result.conf_int().loc[interaction].tolist(),
        "summary": result.summary(),
    }

# --- Causal inference checklist ---
# 1. Validate parallel trends in pre-period before trusting DiD estimates.
# 2. Use HC3 robust standard errors to handle heteroskedasticity.
# 3. For panel data, cluster SEs at the unit level (add groups= param to fit).
# 4. Consider propensity score matching if groups differ at baseline.
# 5. Report the ATT with confidence interval, not just statistical significance.
```

## Reference Documentation

- **Statistical Methods:** `references/statistical_methods_advanced.md`
- **Experiment Design Frameworks:** `references/experiment_design_frameworks.md`
- **Feature Engineering Patterns:** `references/feature_engineering_patterns.md`

## Common Commands

```bash
# Testing & linting
python -m pytest tests/ -v --cov=src/
python -m black src/ && python -m pylint src/

# Training & evaluation
python scripts/train.py --config prod.yaml
python scripts/evaluate.py --model best.pth

# Deployment
docker build -t service:v1 .
kubectl apply -f k8s/
helm upgrade service ./charts/

# Monitoring & health
kubectl logs -f deployment/service
python scripts/health_check.py
```

Related Skills

Database Engineering Mastery

3891
from openclaw/skills

> Complete database design, optimization, migration, and operations system. From schema design to production monitoring — covers PostgreSQL, MySQL, SQLite, and general SQL patterns.

Coding & Development

Data Room Builder

3891
from openclaw/skills

Build a structured virtual data room checklist and folder hierarchy for fundraising, M&A, or due diligence.

Workflow & Productivity

Data Privacy & Protection Program

3891
from openclaw/skills

You are a **Data Privacy Officer (DPO) agent** — a comprehensive privacy program architect. You help organizations build, operate, and mature privacy programs that comply with global regulations (GDPR, CCPA/CPRA, LGPD, PIPEDA, POPIA, APPI, PDPA) while enabling business growth.

Privacy & Compliance

Data Migration Planner

3891
from openclaw/skills

Plan, execute, and validate data migrations between systems. Covers schema mapping, ETL pipeline design, rollback strategies, and post-migration validation.

Workflow & Productivity

Data Governance Framework

3891
from openclaw/skills

Assess, score, and remediate your organization's data governance posture across 6 domains.

Security

Data Analyst — AfrexAI ⚡📊

3891
from openclaw/skills

**Transform raw data into decisions. Not just charts — answers.**

Data & Research

data-analysis-partner

3891
from openclaw/skills

智能数据分析 Skill,输入 CSV/Excel 文件和分析需求,输出带交互式 ECharts 图表的 HTML 自包含分析报告

Data & Research

Made-in-China Trade Data Analyst

3891
from openclaw/skills

**Short Description**

ercdata

3891
from openclaw/skills

Store, verify, and manage AI data on the Ethereum blockchain (Base network) using the ERCData standard. Use when an agent needs to store data fingerprints on-chain, verify data integrity, create audit trails, manage access control for private data, or interact with the ERCData smart contract. Supports public and private storage, EIP-712 verification, snapshots, and batch operations.

Security

Data Engineering Command Center

3880
from openclaw/skills

Complete methodology for designing, building, operating, and scaling data pipelines and infrastructure. Zero dependencies — pure agent skill.

Data Engineering

ths-financial-data

3891
from openclaw/skills

该skill用于获取股票市场数据,包括实时行情、中文名称查询、键盘缩写转换、资金流向和日K线数据。使用thsdk库提供同花顺数据接口支持。支持自动将中文、缩写、短代码转换为thsdk所需的完整ths_code格式。当匹配到多只股票时,会返回候选列表供用户选择。如未安装thsdk,会自动安装。

datafast-analytics

3891
from openclaw/skills

Query DataFast website analytics and visitor data via the DataFast API for metrics, time series, realtime stats, breakdowns, visitor details, and goal/payment management.