sklearn-model-trainer

Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.

509 stars

Best use case

sklearn-model-trainer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.

Teams using sklearn-model-trainer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/sklearn-model-trainer/SKILL.md --create-dirs "https://raw.githubusercontent.com/a5c-ai/babysitter/main/library/specializations/data-science-ml/skills/sklearn-model-trainer/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/sklearn-model-trainer/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How sklearn-model-trainer Compares

Feature / Agentsklearn-model-trainerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Scikit-learn Model Trainer

Train machine learning models using scikit-learn with cross-validation, hyperparameter tuning, and pipeline construction.

## Overview

This skill provides comprehensive capabilities for training machine learning models using scikit-learn. It supports the full model development workflow from data preprocessing through model training, evaluation, and serialization.

## Capabilities

### Model Training
- Train classification models (LogisticRegression, RandomForest, SVM, etc.)
- Train regression models (LinearRegression, GradientBoosting, etc.)
- Train clustering models (KMeans, DBSCAN, etc.)
- Support for ensemble methods (VotingClassifier, Stacking, etc.)

### Cross-Validation
- K-fold cross-validation
- Stratified K-fold for imbalanced datasets
- Time series split for temporal data
- Leave-one-out and leave-p-out validation
- Custom cross-validation strategies

### Hyperparameter Tuning
- GridSearchCV for exhaustive search
- RandomizedSearchCV for random sampling
- Halving search strategies for efficiency
- Custom scoring functions
- Multi-metric evaluation

### Pipeline Construction
- Feature preprocessing pipelines
- Column transformers for heterogeneous data
- Feature selection integration
- Composite pipelines with caching

### Model Serialization
- Save models with joblib (recommended)
- Pickle serialization
- ONNX export for interoperability
- Model versioning support

## Prerequisites

### Installation
```bash
pip install scikit-learn>=1.0.0 joblib pandas numpy
```

### Optional Dependencies
```bash
# For ONNX export
pip install skl2onnx onnxruntime

# For additional preprocessing
pip install category_encoders imbalanced-learn
```

## Usage Patterns

### Basic Model Training
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import joblib

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
model.fit(X_train, y_train)

# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Save model
joblib.dump(model, 'model.joblib')
```

### Pipeline with Preprocessing
```python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

# Define preprocessing
numeric_features = ['age', 'income', 'score']
categorical_features = ['category', 'region']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Create full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier())
])

# Train
pipeline.fit(X_train, y_train)
```

### Hyperparameter Tuning with GridSearchCV
```python
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 10, None],
    'classifier__learning_rate': [0.01, 0.1, 0.2]
}

# Grid search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

# Get best model
best_model = grid_search.best_estimator_
```

### Feature Selection
```python
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.ensemble import RandomForestClassifier

# Method 1: SelectFromModel
selector = SelectFromModel(
    RandomForestClassifier(n_estimators=100, random_state=42),
    threshold='median'
)
X_selected = selector.fit_transform(X_train, y_train)

# Method 2: Recursive Feature Elimination
rfe = RFE(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    n_features_to_select=10,
    step=1
)
X_rfe = rfe.fit_transform(X_train, y_train)

# Get selected features
selected_features = X.columns[rfe.support_].tolist()
```

## Integration with Babysitter SDK

### Task Definition Example
```javascript
const sklearnTrainingTask = defineTask({
  name: 'sklearn-model-training',
  description: 'Train a scikit-learn model with cross-validation',

  inputs: {
    modelType: { type: 'string', required: true },
    trainDataPath: { type: 'string', required: true },
    targetColumn: { type: 'string', required: true },
    hyperparameters: { type: 'object', default: {} },
    cvFolds: { type: 'number', default: 5 },
    scoringMetric: { type: 'string', default: 'accuracy' }
  },

  outputs: {
    modelPath: { type: 'string' },
    cvScores: { type: 'array' },
    bestScore: { type: 'number' },
    featureImportances: { type: 'object' }
  },

  async run(inputs, taskCtx) {
    return {
      kind: 'skill',
      title: `Train ${inputs.modelType} model`,
      skill: {
        name: 'sklearn-model-trainer',
        context: {
          operation: 'train_with_cv',
          modelType: inputs.modelType,
          trainDataPath: inputs.trainDataPath,
          targetColumn: inputs.targetColumn,
          hyperparameters: inputs.hyperparameters,
          cvFolds: inputs.cvFolds,
          scoringMetric: inputs.scoringMetric
        }
      },
      io: {
        inputJsonPath: `tasks/${taskCtx.effectId}/input.json`,
        outputJsonPath: `tasks/${taskCtx.effectId}/result.json`
      }
    };
  }
});
```

## Model Selection Guide

### Classification Models

| Model | Use Case | Pros | Cons |
|-------|----------|------|------|
| LogisticRegression | Binary/multiclass, interpretable | Fast, interpretable | Linear boundary |
| RandomForestClassifier | General purpose | Robust, handles nonlinearity | Can overfit |
| GradientBoostingClassifier | High accuracy needed | State-of-art performance | Slower training |
| SVC | Small/medium datasets | Effective in high dimensions | Slow on large data |
| XGBClassifier | Competition/production | Fast, accurate | Many hyperparameters |

### Regression Models

| Model | Use Case | Pros | Cons |
|-------|----------|------|------|
| LinearRegression | Baseline, interpretable | Simple, fast | Assumes linearity |
| Ridge/Lasso | Regularization needed | Prevents overfitting | Still linear |
| RandomForestRegressor | General purpose | Handles nonlinearity | Can overfit |
| GradientBoostingRegressor | High accuracy | Excellent performance | Slower |
| SVR | Small datasets | Robust to outliers | Slow scaling |

## Best Practices

1. **Always Use Pipelines**: Prevent data leakage by including preprocessing in pipelines
2. **Stratified Splits**: Use stratified sampling for imbalanced classification
3. **Cross-Validation**: Never tune hyperparameters on test data
4. **Feature Scaling**: Apply appropriate scaling for distance-based models
5. **Random Seeds**: Set random_state for reproducibility
6. **Model Persistence**: Use joblib over pickle for large numpy arrays

## References

- [Scikit-learn Documentation](https://scikit-learn.org/stable/)
- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Claude Scientific Skills - sklearn](https://github.com/K-Dense-AI/claude-scientific-skills)
- [ML Models as MCP Tools](https://medium.com/@premlaknaboina/how-to-wrap-machine-learning-models-as-mcp-tools-1e510b21f1f9)

Related Skills

model

509
from a5c-ai/babysitter

Inspect or change Babysitter model-routing policy by phase.

threat-modeler

509
from a5c-ai/babysitter

Generate threat models using STRIDE, PASTA, or VAST methodologies

urdf-sdf-model

509
from a5c-ai/babysitter

Expert skill for robot model creation and validation in URDF and SDF formats. Generate URDF files with proper link-joint hierarchy, create Xacro macros, calculate inertial properties, configure joint types, and validate models.

topic-modeling-text-mining

509
from a5c-ai/babysitter

Apply LDA, NMF, and other computational methods to discover patterns in large text corpora with appropriate parameter tuning

systems-dynamics-modeler

509
from a5c-ai/babysitter

Skill for building and simulating systems dynamics models

vqc-trainer

509
from a5c-ai/babysitter

Variational quantum classifier training skill with gradient optimization

noise-modeler

509
from a5c-ai/babysitter

Quantum noise modeling skill for simulation and hardware characterization

pymc-bayesian-modeler

509
from a5c-ai/babysitter

PyMC probabilistic programming skill for hierarchical Bayesian models in physics data analysis

comsol-multiphysics-modeler

509
from a5c-ai/babysitter

COMSOL finite element skill for multiphysics simulations including electromagnetics, heat transfer, and fluid dynamics

environmental-fate-modeler

509
from a5c-ai/babysitter

Environmental nanosafety skill for modeling nanomaterial environmental fate and transport

cad-modeling

509
from a5c-ai/babysitter

Expert skill for parametric 3D CAD model development with design intent and configuration management

stan-bayesian-modeling

509
from a5c-ai/babysitter

Stan probabilistic programming for Bayesian inference