experiment-tracking-setup

Guide for experiment tracking tool setup (MLflow, Weights & Biases, etc.), reproducibility assurance, model registry, and experiment comparison methodology. Use this skill for ML experiment management involving 'experiment tracking', 'MLflow', 'W&B', 'Weights and Biases', 'reproducibility', 'model registry', 'experiment comparison', 'hyperparameter logging', etc. Enhances the training-manager's experiment management capabilities. Note: model architecture design and feature engineering are outside this skill's scope.

495 stars

byrevfactory

View on GitHub Installation ↓

Best use case

experiment-tracking-setup is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using experiment-tracking-setup should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/experiment-tracking-setup/SKILL.md --create-dirs "https://raw.githubusercontent.com/revfactory/harness-100/main/en/31-ml-experiment/.claude/skills/experiment-tracking-setup/skill.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/experiment-tracking-setup/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How experiment-tracking-setup Compares

Feature / Agent	experiment-tracking-setup	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Experiment Tracking Setup — Experiment Tracking and Reproducibility Guide

A practical guide for ML experiment tracking, reproducibility assurance, and model version management.

## MLflow Setup

### Basic Structure

```python
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("order-prediction")

with mlflow.start_run(run_name="xgboost-v2"):
    # Parameter logging
    mlflow.log_params({
        "model": "XGBClassifier",
        "n_estimators": 500,
        "max_depth": 6,
        "learning_rate": 0.1,
    })

    # Training
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    # Metric logging
    mlflow.log_metrics({
        "accuracy": accuracy_score(y_test, predictions),
        "f1": f1_score(y_test, predictions),
        "precision": precision_score(y_test, predictions),
        "recall": recall_score(y_test, predictions),
    })

    # Save model
    mlflow.sklearn.log_model(model, "model")

    # Save artifacts
    mlflow.log_artifact("confusion_matrix.png")
    mlflow.log_artifact("feature_importance.csv")
```

### Auto-logging

```python
# Framework-specific auto-logging
mlflow.sklearn.autolog()     # scikit-learn
mlflow.xgboost.autolog()     # XGBoost
mlflow.lightgbm.autolog()    # LightGBM
mlflow.pytorch.autolog()     # PyTorch
mlflow.tensorflow.autolog()  # TensorFlow
```

## Reproducibility Assurance Checklist

### Required Recording Items

```python
import platform, sys

reproducibility_info = {
    # Environment
    "python_version": sys.version,
    "os": platform.platform(),
    "gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "N/A",

    # Seeds
    "random_seed": 42,
    "numpy_seed": 42,
    "torch_seed": 42,

    # Data
    "data_version": "v2.1",
    "data_hash": hashlib.md5(open('data.csv','rb').read()).hexdigest(),
    "train_size": len(X_train),
    "test_size": len(X_test),
    "split_method": "StratifiedKFold(5)",

    # Code
    "git_commit": subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip(),
    "git_branch": subprocess.check_output(['git', 'branch', '--show-current']).decode().strip(),
}
mlflow.log_params(reproducibility_info)
```

### Seed Fixing

```python
import random, numpy as np, torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    os.environ['PYTHONHASHSEED'] = str(seed)
```

### Dependency Pinning

```bash
# requirements.txt with exact versions
pip freeze > requirements.txt

# pip-compile (recommended)
pip-compile requirements.in --generate-hashes

# conda
conda env export --no-builds > environment.yml
```

## Model Registry

### MLflow Model Registry Workflow

```
Experiment
└── Run
    └── Model Artifact
        └── Model Registration (Model Registry)
            ├── Stage: Staging → Validation
            ├── Stage: Production → Deployment
            └── Stage: Archived → Archive
```

```python
# Register model
mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name="order-prediction-model"
)

# Stage transition
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="order-prediction-model",
    version=3,
    stage="Production"
)

# Load Production model
model = mlflow.pyfunc.load_model("models:/order-prediction-model/Production")
```

## Experiment Comparison Framework

### Statistical Verification

```python
from scipy import stats

# Compare 5-fold CV results
model_a_scores = [0.85, 0.87, 0.84, 0.86, 0.88]
model_b_scores = [0.82, 0.84, 0.83, 0.81, 0.85]

# Paired t-test
t_stat, p_value = stats.ttest_rel(model_a_scores, model_b_scores)
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Statistically significant difference")
```

### Experiment Comparison Table

```markdown
| Experiment | Model | F1 | Precision | Recall | Training Time | Inference Time |
|-----------|-------|-----|-----------|--------|--------------|---------------|
| exp-001 | LogReg (baseline) | 0.78 | 0.80 | 0.76 | 2s | 0.1ms |
| exp-002 | XGBoost | 0.85 | 0.87 | 0.83 | 45s | 0.5ms |
| exp-003 | LightGBM | 0.86 | 0.88 | 0.84 | 20s | 0.3ms |
| exp-004 | LightGBM + Optuna | 0.88 | 0.89 | 0.87 | 2h | 0.3ms |
| exp-005 | Stacking (top3) | 0.89 | 0.90 | 0.88 | 3h | 1.2ms |
```

## Project Structure Template

```
ml-project/
├── data/
│   ├── raw/              # Original data (do not modify)
│   ├── processed/        # Preprocessed
│   └── external/         # External data
├── notebooks/            # Exploratory analysis
├── src/
│   ├── data/             # Data loading/preprocessing
│   ├── features/         # Feature engineering
│   ├── models/           # Model definitions
│   └── evaluation/       # Evaluation logic
├── configs/              # Hyperparameter YAML
├── models/               # Trained models
├── reports/              # Analysis reports
├── requirements.txt
└── Makefile              # Reproducible execution
```

Related Skills

ml-experiment

495

from revfactory/harness-100

A full ML pipeline where an agent team collaborates to perform data preparation, model design, training, evaluation, and deployment readiness. Use this skill for 'design an ML experiment', 'train a model', 'machine learning project', 'build a deep learning model', 'classification model', 'regression model', 'data preprocessing', 'model evaluation', 'hyperparameter tuning', 'MLOps setup', 'XGBoost model', 'PyTorch model', and other ML experiment tasks. Supports data-preprocessing-only or evaluation-only requests as well. Note: model serving infrastructure (SageMaker/Vertex AI) direct deployment, large-scale distributed training cluster management, and real-time inference service operation are outside this skill's scope.

sustainability-audit

495

from revfactory/harness-100

Full audit pipeline for ESG/sustainability where an agent team collaborates to generate environmental, social, and governance assessments along with an integrated report and improvement plan. Use this skill for requests such as 'run an ESG audit', 'write a sustainability report', 'ESG assessment', 'carbon emissions calculation', 'ESG rating diagnosis', 'governance review', 'social responsibility assessment', 'GRI report', 'TCFD disclosure', 'ESG improvement plan', and other ESG/sustainability tasks. Also supports assessment of specific pillars (E/S/G) only or improving existing reports. However, actual on-site audit execution, third-party verification certificate issuance, ESG rating agency score changes, and carbon credit trading are outside the scope of this skill.

materiality-assessment

495

from revfactory/harness-100

ESG materiality assessment matrix. Referenced by the esg-reporter and improvement-planner agents when evaluating ESG issue materiality and setting priorities. Use for 'materiality assessment', 'importance analysis', or 'Materiality Matrix' requests. Stakeholder surveys and external certification are out of scope.

ghg-protocol

495

from revfactory/harness-100

GHG Protocol detailed guide. Referenced by the environmental-analyst agent when calculating and reporting greenhouse gas emissions. Use for 'GHG Protocol', 'carbon emissions', 'Scope 1/2/3', or 'carbon footprint' requests. Carbon credit trading and CDM project execution are out of scope.

citation-standards

495

from revfactory/harness-100

Academic citation and reference standards guide. Referenced by the paper-writer and submission-preparer agents when composing citations and references. Use for 'citation format', 'APA', or 'references' requests. Original paper retrieval and professional database access are out of scope.

academic-paper

495

from revfactory/harness-100

Full research pipeline for academic paper writing where an agent team collaborates to generate research design, experiment protocols, analysis, manuscript writing, and submission preparation. Use this skill for requests such as 'write an academic paper', 'research paper writing', 'help me write a paper', 'design a study', 'run statistical analysis', 'prepare journal submission', 'manuscript writing', 'research methodology design', 'hypothesis testing', 'academic writing', and other academic research paper tasks. Also supports analysis, rewriting, and submission preparation when existing data or drafts are available. However, actual data collection execution, official IRB submission, journal system login and upload, and running actual statistical software are outside the scope of this skill.

product-copy-formulas

495

from revfactory/harness-100

Product copy formula library. Referenced by the detail-page-writer and marketing-manager agents when writing purchase-driving copy. Use for 'product copy', 'marketing copy', or 'ad copy' requests. Ad placement and design mockup creation are out of scope.

ecommerce-launcher

495

from revfactory/harness-100

Full launch pipeline for e-commerce products where an agent team collaborates to generate product planning, detail pages, pricing strategy, marketing, and CS setup all at once. Use this skill for requests such as 'launch an e-commerce product', 'prepare a product launch', 'register a product on Naver Smart Store', 'launch on Coupang', 'create a detail page', 'develop a pricing strategy', 'create a marketing plan', 'launch prep', 'product planning brief', 'e-commerce CS manual', and other e-commerce product launch tasks. Also supports supplementing pricing/marketing/CS even when existing briefs or detail pages are provided. However, actual platform API integration (automated product registration), payment system development, logistics system integration, and real-time order management are outside the scope of this skill.

conversion-optimization

495

from revfactory/harness-100

Purchase conversion optimization framework. Referenced by the detail-page-writer and pricing-strategist agents when designing detail pages and pricing with a conversion focus. Use for 'conversion rate optimization', 'CRO', or 'purchase psychology' requests. A/B testing tool setup and funnel automation are out of scope.

real-estate-analyst

495

from revfactory/harness-100

Real estate investment analysis pipeline. An agent team collaborates to produce market research, location analysis, profitability analysis, risk assessment, and investment reports. Use this skill for requests such as 'analyze this real estate', 'apartment investment analysis', 'studio apartment yield', 'real estate market research', 'location analysis', 'real estate investment report', 'buy vs lease', 'reconstruction investment analysis', 'commercial property yield analysis', and other general real estate investment analysis tasks. Actual purchase contracts, brokerage services, interior design, and property management are outside the scope of this skill.

location-scoring

495

from revfactory/harness-100

Location scoring scorecard. Referenced by the location-analyst agent for systematic real estate location evaluation. Use for requests involving 'location analysis', 'location assessment', or 'commercial area analysis'. On-site inspections and surveying are out of scope.

cap-rate-calculator

495

from revfactory/harness-100

Real estate yield calculator. Reference formulas and models used by the profitability-analyst agent for quantitative investment return analysis. Use for requests involving 'Cap Rate', 'yield analysis', 'DCF', or 'cash flow analysis'. Tax advisory and loan underwriting are out of scope.