ml-experiment
A full ML pipeline where an agent team collaborates to perform data preparation, model design, training, evaluation, and deployment readiness. Use this skill for 'design an ML experiment', 'train a model', 'machine learning project', 'build a deep learning model', 'classification model', 'regression model', 'data preprocessing', 'model evaluation', 'hyperparameter tuning', 'MLOps setup', 'XGBoost model', 'PyTorch model', and other ML experiment tasks. Supports data-preprocessing-only or evaluation-only requests as well. Note: model serving infrastructure (SageMaker/Vertex AI) direct deployment, large-scale distributed training cluster management, and real-time inference service operation are outside this skill's scope.
Best use case
ml-experiment is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
A full ML pipeline where an agent team collaborates to perform data preparation, model design, training, evaluation, and deployment readiness. Use this skill for 'design an ML experiment', 'train a model', 'machine learning project', 'build a deep learning model', 'classification model', 'regression model', 'data preprocessing', 'model evaluation', 'hyperparameter tuning', 'MLOps setup', 'XGBoost model', 'PyTorch model', and other ML experiment tasks. Supports data-preprocessing-only or evaluation-only requests as well. Note: model serving infrastructure (SageMaker/Vertex AI) direct deployment, large-scale distributed training cluster management, and real-time inference service operation are outside this skill's scope.
Teams using ml-experiment should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/ml-experiment/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How ml-experiment Compares
| Feature / Agent | ml-experiment | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
A full ML pipeline where an agent team collaborates to perform data preparation, model design, training, evaluation, and deployment readiness. Use this skill for 'design an ML experiment', 'train a model', 'machine learning project', 'build a deep learning model', 'classification model', 'regression model', 'data preprocessing', 'model evaluation', 'hyperparameter tuning', 'MLOps setup', 'XGBoost model', 'PyTorch model', and other ML experiment tasks. Supports data-preprocessing-only or evaluation-only requests as well. Note: model serving infrastructure (SageMaker/Vertex AI) direct deployment, large-scale distributed training cluster management, and real-time inference service operation are outside this skill's scope.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
# ML Experiment — Full ML Pipeline
An agent team collaborates to perform the full ML experiment lifecycle: data preparation → model design → training → evaluation → deployment readiness.
## Execution Mode
**Agent Team** — 5 members communicate directly via SendMessage and cross-validate.
## Agent Composition
| Agent | File | Role | Type |
|-------|------|------|------|
| data-engineer | `.claude/agents/data-engineer.md` | Collection, preprocessing, feature engineering | general-purpose |
| model-designer | `.claude/agents/model-designer.md` | Architecture, hyperparameters, loss functions | general-purpose |
| training-manager | `.claude/agents/training-manager.md` | Experiment tracking, checkpoints, reproducibility | general-purpose |
| evaluation-analyst | `.claude/agents/evaluation-analyst.md` | Metrics, bias verification, interpretability | general-purpose |
| experiment-reviewer | `.claude/agents/experiment-reviewer.md` | Cross-validation, reproducibility, final report | general-purpose |
## Workflow
### Phase 1: Preparation (Orchestrator performs directly)
1. Extract from user input:
- **Problem Definition**: Classification/regression/generation/recommendation/time-series, etc.
- **Data**: Data source, files, format, scale
- **Target Metric**: Specific goals such as accuracy, F1, RMSE
- **Constraints** (optional): Framework, GPU, inference speed, model size
- **Existing Code** (optional): Existing models, preprocessing code, experiment results
2. Create `_workspace/` directory at the project root
3. Organize input and save to `_workspace/00_input.md`
4. If existing files are present, copy to `_workspace/` and skip the corresponding Phase
5. **Determine execution mode** based on request scope
### Phase 2: Team Assembly and Execution
| Order | Task | Owner | Dependencies | Output |
|-------|------|-------|-------------|--------|
| 1 | Data Preparation | data-engineer | None | `_workspace/01_data_preparation.md` |
| 2 | Model Design | model-designer | Task 1 | `_workspace/02_model_design.md` |
| 3 | Training Setup | training-manager | Tasks 1, 2 | `_workspace/03_training_config.md` |
| 4 | Evaluation Analysis | evaluation-analyst | Tasks 1, 2, 3 | `_workspace/04_evaluation_report.md` |
| 5 | Experiment Review | experiment-reviewer | Tasks 1-4 | `_workspace/05_review_report.md` |
**Inter-team communication flow:**
- data-engineer completes → Sends feature/shape/data characteristics to model-designer, data loader to training, class distribution to evaluation
- model-designer completes → Sends model code/hyperparameter space to training, model structure/evaluation metrics to evaluation
- training completes → Sends training curves/best model/experiment logs to evaluation
- evaluation completes → Sends evaluation report to reviewer
- reviewer cross-validates all outputs. If 🔴 must-fix issues found, sends correction requests to the relevant agent → rework → re-verify (up to 2 times)
### Phase 3: Integration and Final Outputs
1. Check all files in `_workspace/`
2. Verify that all 🔴 must-fix items from the review report have been addressed
3. Report final summary to the user:
- Data Preparation — `01_data_preparation.md`
- Model Design — `02_model_design.md`
- Training Configuration — `03_training_config.md`
- Evaluation Report — `04_evaluation_report.md`
- Review Report — `05_review_report.md`
- Experiment Code — `experiment_code/`
## Scale-Based Modes
| User Request Pattern | Execution Mode | Agents Deployed |
|---------------------|---------------|-----------------|
| "Design the full ML experiment" | **Full Pipeline** | All 5 |
| "Preprocess the data" | **Data Mode** | data-engineer + reviewer |
| "Design the model architecture" | **Model Mode** | model-designer + reviewer |
| "Evaluate this model" (existing results) | **Evaluation Mode** | evaluation-analyst + reviewer |
| "Review this experiment" | **Review Mode** | reviewer only |
**Leveraging existing files**: If the user provides preprocessing code, trained models, etc., skip the corresponding steps.
## Data Transfer Protocol
| Strategy | Method | Purpose |
|----------|--------|---------|
| File-based | `_workspace/` directory | Primary output storage and sharing |
| Code-based | `_workspace/experiment_code/` | Executable code |
| Message-based | SendMessage | Real-time key information transfer, correction requests |
File naming convention: `{order}_{agent}_{output}.{extension}`
## Error Handling
| Error Type | Strategy |
|-----------|----------|
| Data not provided | Recommend public datasets + provide synthetic data generation code |
| No GPU | CPU-optimized settings + prioritize lightweight models |
| Problem type unclear | Infer from data characteristics + request user confirmation |
| Training divergence | Suggest LR reduction, Gradient Clipping, batch size adjustment |
| Agent failure | 1 retry → proceed without that output if failed, note omission in review report |
| 🔴 found in review | Send correction request to relevant agent → rework → re-verify (up to 2 times) |
## Test Scenarios
### Normal Flow
**Prompt**: "Build a survival prediction classification model using the Kaggle Titanic dataset. Target F1 score above 0.85."
**Expected Results**:
- Data: EDA (missing values, distributions, correlations), preprocessing pipeline (Imputer+Scaler+Encoder), stratified split
- Model: Baseline (LogisticRegression) + XGBoost + RandomForest design
- Training: Optuna hyperparameter tuning, MLflow experiment tracking
- Evaluation: Confusion Matrix, SHAP analysis, model comparison, statistical verification
- Review: No data leakage confirmed, reproducibility confirmed, conclusion validity verified
### Existing File Flow
**Prompt**: "Evaluate this trained model and suggest improvement directions" + model file attached
**Expected Results**:
- Copy existing model to `_workspace/`
- Evaluation mode: evaluation-analyst + reviewer deployed
- Performance analysis, error analysis, improvement recommendations provided
### Error Flow
**Prompt**: "Build a machine learning model, but I don't have data yet"
**Expected Results**:
- Request problem type confirmation
- Recommend 3-5 public datasets (UCI/Kaggle/HuggingFace)
- Provide synthetic data generation code
- State "Full pipeline can be executed after data acquisition"
## Agent Extension Skills
| Skill | Path | Enhanced Agent | Role |
|-------|------|---------------|------|
| feature-engineering-cookbook | `.claude/skills/feature-engineering-cookbook/skill.md` | data-engineer | Numeric/categorical/time-series transformations, feature selection, data leakage prevention |
| model-selection-guide | `.claude/skills/model-selection-guide/skill.md` | model-designer, evaluation-analyst | Model recommendations by problem, hyperparameter tuning, ensembles |
| experiment-tracking-setup | `.claude/skills/experiment-tracking-setup/skill.md` | training-manager | MLflow setup, reproducibility, model registry, experiment comparison |Related Skills
experiment-tracking-setup
Guide for experiment tracking tool setup (MLflow, Weights & Biases, etc.), reproducibility assurance, model registry, and experiment comparison methodology. Use this skill for ML experiment management involving 'experiment tracking', 'MLflow', 'W&B', 'Weights and Biases', 'reproducibility', 'model registry', 'experiment comparison', 'hyperparameter logging', etc. Enhances the training-manager's experiment management capabilities. Note: model architecture design and feature engineering are outside this skill's scope.
sustainability-audit
Full audit pipeline for ESG/sustainability where an agent team collaborates to generate environmental, social, and governance assessments along with an integrated report and improvement plan. Use this skill for requests such as 'run an ESG audit', 'write a sustainability report', 'ESG assessment', 'carbon emissions calculation', 'ESG rating diagnosis', 'governance review', 'social responsibility assessment', 'GRI report', 'TCFD disclosure', 'ESG improvement plan', and other ESG/sustainability tasks. Also supports assessment of specific pillars (E/S/G) only or improving existing reports. However, actual on-site audit execution, third-party verification certificate issuance, ESG rating agency score changes, and carbon credit trading are outside the scope of this skill.
materiality-assessment
ESG materiality assessment matrix. Referenced by the esg-reporter and improvement-planner agents when evaluating ESG issue materiality and setting priorities. Use for 'materiality assessment', 'importance analysis', or 'Materiality Matrix' requests. Stakeholder surveys and external certification are out of scope.
ghg-protocol
GHG Protocol detailed guide. Referenced by the environmental-analyst agent when calculating and reporting greenhouse gas emissions. Use for 'GHG Protocol', 'carbon emissions', 'Scope 1/2/3', or 'carbon footprint' requests. Carbon credit trading and CDM project execution are out of scope.
citation-standards
Academic citation and reference standards guide. Referenced by the paper-writer and submission-preparer agents when composing citations and references. Use for 'citation format', 'APA', or 'references' requests. Original paper retrieval and professional database access are out of scope.
academic-paper
Full research pipeline for academic paper writing where an agent team collaborates to generate research design, experiment protocols, analysis, manuscript writing, and submission preparation. Use this skill for requests such as 'write an academic paper', 'research paper writing', 'help me write a paper', 'design a study', 'run statistical analysis', 'prepare journal submission', 'manuscript writing', 'research methodology design', 'hypothesis testing', 'academic writing', and other academic research paper tasks. Also supports analysis, rewriting, and submission preparation when existing data or drafts are available. However, actual data collection execution, official IRB submission, journal system login and upload, and running actual statistical software are outside the scope of this skill.
product-copy-formulas
Product copy formula library. Referenced by the detail-page-writer and marketing-manager agents when writing purchase-driving copy. Use for 'product copy', 'marketing copy', or 'ad copy' requests. Ad placement and design mockup creation are out of scope.
ecommerce-launcher
Full launch pipeline for e-commerce products where an agent team collaborates to generate product planning, detail pages, pricing strategy, marketing, and CS setup all at once. Use this skill for requests such as 'launch an e-commerce product', 'prepare a product launch', 'register a product on Naver Smart Store', 'launch on Coupang', 'create a detail page', 'develop a pricing strategy', 'create a marketing plan', 'launch prep', 'product planning brief', 'e-commerce CS manual', and other e-commerce product launch tasks. Also supports supplementing pricing/marketing/CS even when existing briefs or detail pages are provided. However, actual platform API integration (automated product registration), payment system development, logistics system integration, and real-time order management are outside the scope of this skill.
conversion-optimization
Purchase conversion optimization framework. Referenced by the detail-page-writer and pricing-strategist agents when designing detail pages and pricing with a conversion focus. Use for 'conversion rate optimization', 'CRO', or 'purchase psychology' requests. A/B testing tool setup and funnel automation are out of scope.
real-estate-analyst
Real estate investment analysis pipeline. An agent team collaborates to produce market research, location analysis, profitability analysis, risk assessment, and investment reports. Use this skill for requests such as 'analyze this real estate', 'apartment investment analysis', 'studio apartment yield', 'real estate market research', 'location analysis', 'real estate investment report', 'buy vs lease', 'reconstruction investment analysis', 'commercial property yield analysis', and other general real estate investment analysis tasks. Actual purchase contracts, brokerage services, interior design, and property management are outside the scope of this skill.
location-scoring
Location scoring scorecard. Referenced by the location-analyst agent for systematic real estate location evaluation. Use for requests involving 'location analysis', 'location assessment', or 'commercial area analysis'. On-site inspections and surveying are out of scope.
cap-rate-calculator
Real estate yield calculator. Reference formulas and models used by the profitability-analyst agent for quantitative investment return analysis. Use for requests involving 'Cap Rate', 'yield analysis', 'DCF', or 'cash flow analysis'. Tax advisory and loan underwriting are out of scope.