experiment-tracking-setup
Guide for experiment tracking tool setup (MLflow, Weights & Biases, etc.), reproducibility assurance, model registry, and experiment comparison methodology. Use this skill for ML experiment management involving 'experiment tracking', 'MLflow', 'W&B', 'Weights and Biases', 'reproducibility', 'model registry', 'experiment comparison', 'hyperparameter logging', etc. Enhances the training-manager's experiment management capabilities. Note: model architecture design and feature engineering are outside this skill's scope.
Best use case
experiment-tracking-setup is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Guide for experiment tracking tool setup (MLflow, Weights & Biases, etc.), reproducibility assurance, model registry, and experiment comparison methodology. Use this skill for ML experiment management involving 'experiment tracking', 'MLflow', 'W&B', 'Weights and Biases', 'reproducibility', 'model registry', 'experiment comparison', 'hyperparameter logging', etc. Enhances the training-manager's experiment management capabilities. Note: model architecture design and feature engineering are outside this skill's scope.
Teams using experiment-tracking-setup should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/experiment-tracking-setup/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How experiment-tracking-setup Compares
| Feature / Agent | experiment-tracking-setup | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Guide for experiment tracking tool setup (MLflow, Weights & Biases, etc.), reproducibility assurance, model registry, and experiment comparison methodology. Use this skill for ML experiment management involving 'experiment tracking', 'MLflow', 'W&B', 'Weights and Biases', 'reproducibility', 'model registry', 'experiment comparison', 'hyperparameter logging', etc. Enhances the training-manager's experiment management capabilities. Note: model architecture design and feature engineering are outside this skill's scope.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Experiment Tracking Setup — Experiment Tracking and Reproducibility Guide
A practical guide for ML experiment tracking, reproducibility assurance, and model version management.
## MLflow Setup
### Basic Structure
```python
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("order-prediction")
with mlflow.start_run(run_name="xgboost-v2"):
# Parameter logging
mlflow.log_params({
"model": "XGBClassifier",
"n_estimators": 500,
"max_depth": 6,
"learning_rate": 0.1,
})
# Training
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Metric logging
mlflow.log_metrics({
"accuracy": accuracy_score(y_test, predictions),
"f1": f1_score(y_test, predictions),
"precision": precision_score(y_test, predictions),
"recall": recall_score(y_test, predictions),
})
# Save model
mlflow.sklearn.log_model(model, "model")
# Save artifacts
mlflow.log_artifact("confusion_matrix.png")
mlflow.log_artifact("feature_importance.csv")
```
### Auto-logging
```python
# Framework-specific auto-logging
mlflow.sklearn.autolog() # scikit-learn
mlflow.xgboost.autolog() # XGBoost
mlflow.lightgbm.autolog() # LightGBM
mlflow.pytorch.autolog() # PyTorch
mlflow.tensorflow.autolog() # TensorFlow
```
## Reproducibility Assurance Checklist
### Required Recording Items
```python
import platform, sys
reproducibility_info = {
# Environment
"python_version": sys.version,
"os": platform.platform(),
"gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "N/A",
# Seeds
"random_seed": 42,
"numpy_seed": 42,
"torch_seed": 42,
# Data
"data_version": "v2.1",
"data_hash": hashlib.md5(open('data.csv','rb').read()).hexdigest(),
"train_size": len(X_train),
"test_size": len(X_test),
"split_method": "StratifiedKFold(5)",
# Code
"git_commit": subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip(),
"git_branch": subprocess.check_output(['git', 'branch', '--show-current']).decode().strip(),
}
mlflow.log_params(reproducibility_info)
```
### Seed Fixing
```python
import random, numpy as np, torch
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
os.environ['PYTHONHASHSEED'] = str(seed)
```
### Dependency Pinning
```bash
# requirements.txt with exact versions
pip freeze > requirements.txt
# pip-compile (recommended)
pip-compile requirements.in --generate-hashes
# conda
conda env export --no-builds > environment.yml
```
## Model Registry
### MLflow Model Registry Workflow
```
Experiment
└── Run
└── Model Artifact
└── Model Registration (Model Registry)
├── Stage: Staging → Validation
├── Stage: Production → Deployment
└── Stage: Archived → Archive
```
```python
# Register model
mlflow.register_model(
model_uri=f"runs:/{run_id}/model",
name="order-prediction-model"
)
# Stage transition
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="order-prediction-model",
version=3,
stage="Production"
)
# Load Production model
model = mlflow.pyfunc.load_model("models:/order-prediction-model/Production")
```
## Experiment Comparison Framework
### Statistical Verification
```python
from scipy import stats
# Compare 5-fold CV results
model_a_scores = [0.85, 0.87, 0.84, 0.86, 0.88]
model_b_scores = [0.82, 0.84, 0.83, 0.81, 0.85]
# Paired t-test
t_stat, p_value = stats.ttest_rel(model_a_scores, model_b_scores)
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Statistically significant difference")
```
### Experiment Comparison Table
```markdown
| Experiment | Model | F1 | Precision | Recall | Training Time | Inference Time |
|-----------|-------|-----|-----------|--------|--------------|---------------|
| exp-001 | LogReg (baseline) | 0.78 | 0.80 | 0.76 | 2s | 0.1ms |
| exp-002 | XGBoost | 0.85 | 0.87 | 0.83 | 45s | 0.5ms |
| exp-003 | LightGBM | 0.86 | 0.88 | 0.84 | 20s | 0.3ms |
| exp-004 | LightGBM + Optuna | 0.88 | 0.89 | 0.87 | 2h | 0.3ms |
| exp-005 | Stacking (top3) | 0.89 | 0.90 | 0.88 | 3h | 1.2ms |
```
## Project Structure Template
```
ml-project/
├── data/
│ ├── raw/ # Original data (do not modify)
│ ├── processed/ # Preprocessed
│ └── external/ # External data
├── notebooks/ # Exploratory analysis
├── src/
│ ├── data/ # Data loading/preprocessing
│ ├── features/ # Feature engineering
│ ├── models/ # Model definitions
│ └── evaluation/ # Evaluation logic
├── configs/ # Hyperparameter YAML
├── models/ # Trained models
├── reports/ # Analysis reports
├── requirements.txt
└── Makefile # Reproducible execution
```Related Skills
ml-experiment
A full ML pipeline where an agent team collaborates to perform data preparation, model design, training, evaluation, and deployment readiness. Use this skill for 'design an ML experiment', 'train a model', 'machine learning project', 'build a deep learning model', 'classification model', 'regression model', 'data preprocessing', 'model evaluation', 'hyperparameter tuning', 'MLOps setup', 'XGBoost model', 'PyTorch model', and other ML experiment tasks. Supports data-preprocessing-only or evaluation-only requests as well. Note: model serving infrastructure (SageMaker/Vertex AI) direct deployment, large-scale distributed training cluster management, and real-time inference service operation are outside this skill's scope.
sustainability-audit
Full audit pipeline for ESG/sustainability where an agent team collaborates to generate environmental, social, and governance assessments along with an integrated report and improvement plan. Use this skill for requests such as 'run an ESG audit', 'write a sustainability report', 'ESG assessment', 'carbon emissions calculation', 'ESG rating diagnosis', 'governance review', 'social responsibility assessment', 'GRI report', 'TCFD disclosure', 'ESG improvement plan', and other ESG/sustainability tasks. Also supports assessment of specific pillars (E/S/G) only or improving existing reports. However, actual on-site audit execution, third-party verification certificate issuance, ESG rating agency score changes, and carbon credit trading are outside the scope of this skill.
materiality-assessment
ESG materiality assessment matrix. Referenced by the esg-reporter and improvement-planner agents when evaluating ESG issue materiality and setting priorities. Use for 'materiality assessment', 'importance analysis', or 'Materiality Matrix' requests. Stakeholder surveys and external certification are out of scope.
ghg-protocol
GHG Protocol detailed guide. Referenced by the environmental-analyst agent when calculating and reporting greenhouse gas emissions. Use for 'GHG Protocol', 'carbon emissions', 'Scope 1/2/3', or 'carbon footprint' requests. Carbon credit trading and CDM project execution are out of scope.
citation-standards
Academic citation and reference standards guide. Referenced by the paper-writer and submission-preparer agents when composing citations and references. Use for 'citation format', 'APA', or 'references' requests. Original paper retrieval and professional database access are out of scope.
academic-paper
Full research pipeline for academic paper writing where an agent team collaborates to generate research design, experiment protocols, analysis, manuscript writing, and submission preparation. Use this skill for requests such as 'write an academic paper', 'research paper writing', 'help me write a paper', 'design a study', 'run statistical analysis', 'prepare journal submission', 'manuscript writing', 'research methodology design', 'hypothesis testing', 'academic writing', and other academic research paper tasks. Also supports analysis, rewriting, and submission preparation when existing data or drafts are available. However, actual data collection execution, official IRB submission, journal system login and upload, and running actual statistical software are outside the scope of this skill.
product-copy-formulas
Product copy formula library. Referenced by the detail-page-writer and marketing-manager agents when writing purchase-driving copy. Use for 'product copy', 'marketing copy', or 'ad copy' requests. Ad placement and design mockup creation are out of scope.
ecommerce-launcher
Full launch pipeline for e-commerce products where an agent team collaborates to generate product planning, detail pages, pricing strategy, marketing, and CS setup all at once. Use this skill for requests such as 'launch an e-commerce product', 'prepare a product launch', 'register a product on Naver Smart Store', 'launch on Coupang', 'create a detail page', 'develop a pricing strategy', 'create a marketing plan', 'launch prep', 'product planning brief', 'e-commerce CS manual', and other e-commerce product launch tasks. Also supports supplementing pricing/marketing/CS even when existing briefs or detail pages are provided. However, actual platform API integration (automated product registration), payment system development, logistics system integration, and real-time order management are outside the scope of this skill.
conversion-optimization
Purchase conversion optimization framework. Referenced by the detail-page-writer and pricing-strategist agents when designing detail pages and pricing with a conversion focus. Use for 'conversion rate optimization', 'CRO', or 'purchase psychology' requests. A/B testing tool setup and funnel automation are out of scope.
real-estate-analyst
Real estate investment analysis pipeline. An agent team collaborates to produce market research, location analysis, profitability analysis, risk assessment, and investment reports. Use this skill for requests such as 'analyze this real estate', 'apartment investment analysis', 'studio apartment yield', 'real estate market research', 'location analysis', 'real estate investment report', 'buy vs lease', 'reconstruction investment analysis', 'commercial property yield analysis', and other general real estate investment analysis tasks. Actual purchase contracts, brokerage services, interior design, and property management are outside the scope of this skill.
location-scoring
Location scoring scorecard. Referenced by the location-analyst agent for systematic real estate location evaluation. Use for requests involving 'location analysis', 'location assessment', or 'commercial area analysis'. On-site inspections and surveying are out of scope.
cap-rate-calculator
Real estate yield calculator. Reference formulas and models used by the profitability-analyst agent for quantitative investment return analysis. Use for requests involving 'Cap Rate', 'yield analysis', 'DCF', or 'cash flow analysis'. Tax advisory and loan underwriting are out of scope.