sklearn-model-trainer
Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.
Best use case
sklearn-model-trainer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.
Teams using sklearn-model-trainer should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/sklearn-model-trainer/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How sklearn-model-trainer Compares
| Feature / Agent | sklearn-model-trainer | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Scikit-learn Model Trainer
Train machine learning models using scikit-learn with cross-validation, hyperparameter tuning, and pipeline construction.
## Overview
This skill provides comprehensive capabilities for training machine learning models using scikit-learn. It supports the full model development workflow from data preprocessing through model training, evaluation, and serialization.
## Capabilities
### Model Training
- Train classification models (LogisticRegression, RandomForest, SVM, etc.)
- Train regression models (LinearRegression, GradientBoosting, etc.)
- Train clustering models (KMeans, DBSCAN, etc.)
- Support for ensemble methods (VotingClassifier, Stacking, etc.)
### Cross-Validation
- K-fold cross-validation
- Stratified K-fold for imbalanced datasets
- Time series split for temporal data
- Leave-one-out and leave-p-out validation
- Custom cross-validation strategies
### Hyperparameter Tuning
- GridSearchCV for exhaustive search
- RandomizedSearchCV for random sampling
- Halving search strategies for efficiency
- Custom scoring functions
- Multi-metric evaluation
### Pipeline Construction
- Feature preprocessing pipelines
- Column transformers for heterogeneous data
- Feature selection integration
- Composite pipelines with caching
### Model Serialization
- Save models with joblib (recommended)
- Pickle serialization
- ONNX export for interoperability
- Model versioning support
## Prerequisites
### Installation
```bash
pip install scikit-learn>=1.0.0 joblib pandas numpy
```
### Optional Dependencies
```bash
# For ONNX export
pip install skl2onnx onnxruntime
# For additional preprocessing
pip install category_encoders imbalanced-learn
```
## Usage Patterns
### Basic Model Training
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import joblib
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train model
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
model.fit(X_train, y_train)
# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# Save model
joblib.dump(model, 'model.joblib')
```
### Pipeline with Preprocessing
```python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
# Define preprocessing
numeric_features = ['age', 'income', 'score']
categorical_features = ['category', 'region']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Create full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier())
])
# Train
pipeline.fit(X_train, y_train)
```
### Hyperparameter Tuning with GridSearchCV
```python
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [3, 5, 10, None],
'classifier__learning_rate': [0.01, 0.1, 0.2]
}
# Grid search
grid_search = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1,
verbose=2
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
# Get best model
best_model = grid_search.best_estimator_
```
### Feature Selection
```python
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.ensemble import RandomForestClassifier
# Method 1: SelectFromModel
selector = SelectFromModel(
RandomForestClassifier(n_estimators=100, random_state=42),
threshold='median'
)
X_selected = selector.fit_transform(X_train, y_train)
# Method 2: Recursive Feature Elimination
rfe = RFE(
estimator=RandomForestClassifier(n_estimators=100, random_state=42),
n_features_to_select=10,
step=1
)
X_rfe = rfe.fit_transform(X_train, y_train)
# Get selected features
selected_features = X.columns[rfe.support_].tolist()
```
## Integration with Babysitter SDK
### Task Definition Example
```javascript
const sklearnTrainingTask = defineTask({
name: 'sklearn-model-training',
description: 'Train a scikit-learn model with cross-validation',
inputs: {
modelType: { type: 'string', required: true },
trainDataPath: { type: 'string', required: true },
targetColumn: { type: 'string', required: true },
hyperparameters: { type: 'object', default: {} },
cvFolds: { type: 'number', default: 5 },
scoringMetric: { type: 'string', default: 'accuracy' }
},
outputs: {
modelPath: { type: 'string' },
cvScores: { type: 'array' },
bestScore: { type: 'number' },
featureImportances: { type: 'object' }
},
async run(inputs, taskCtx) {
return {
kind: 'skill',
title: `Train ${inputs.modelType} model`,
skill: {
name: 'sklearn-model-trainer',
context: {
operation: 'train_with_cv',
modelType: inputs.modelType,
trainDataPath: inputs.trainDataPath,
targetColumn: inputs.targetColumn,
hyperparameters: inputs.hyperparameters,
cvFolds: inputs.cvFolds,
scoringMetric: inputs.scoringMetric
}
},
io: {
inputJsonPath: `tasks/${taskCtx.effectId}/input.json`,
outputJsonPath: `tasks/${taskCtx.effectId}/result.json`
}
};
}
});
```
## Model Selection Guide
### Classification Models
| Model | Use Case | Pros | Cons |
|-------|----------|------|------|
| LogisticRegression | Binary/multiclass, interpretable | Fast, interpretable | Linear boundary |
| RandomForestClassifier | General purpose | Robust, handles nonlinearity | Can overfit |
| GradientBoostingClassifier | High accuracy needed | State-of-art performance | Slower training |
| SVC | Small/medium datasets | Effective in high dimensions | Slow on large data |
| XGBClassifier | Competition/production | Fast, accurate | Many hyperparameters |
### Regression Models
| Model | Use Case | Pros | Cons |
|-------|----------|------|------|
| LinearRegression | Baseline, interpretable | Simple, fast | Assumes linearity |
| Ridge/Lasso | Regularization needed | Prevents overfitting | Still linear |
| RandomForestRegressor | General purpose | Handles nonlinearity | Can overfit |
| GradientBoostingRegressor | High accuracy | Excellent performance | Slower |
| SVR | Small datasets | Robust to outliers | Slow scaling |
## Best Practices
1. **Always Use Pipelines**: Prevent data leakage by including preprocessing in pipelines
2. **Stratified Splits**: Use stratified sampling for imbalanced classification
3. **Cross-Validation**: Never tune hyperparameters on test data
4. **Feature Scaling**: Apply appropriate scaling for distance-based models
5. **Random Seeds**: Set random_state for reproducibility
6. **Model Persistence**: Use joblib over pickle for large numpy arrays
## References
- [Scikit-learn Documentation](https://scikit-learn.org/stable/)
- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Claude Scientific Skills - sklearn](https://github.com/K-Dense-AI/claude-scientific-skills)
- [ML Models as MCP Tools](https://medium.com/@premlaknaboina/how-to-wrap-machine-learning-models-as-mcp-tools-1e510b21f1f9)Related Skills
model
Inspect or change Babysitter model-routing policy by phase.
threat-modeler
Generate threat models using STRIDE, PASTA, or VAST methodologies
urdf-sdf-model
Expert skill for robot model creation and validation in URDF and SDF formats. Generate URDF files with proper link-joint hierarchy, create Xacro macros, calculate inertial properties, configure joint types, and validate models.
topic-modeling-text-mining
Apply LDA, NMF, and other computational methods to discover patterns in large text corpora with appropriate parameter tuning
systems-dynamics-modeler
Skill for building and simulating systems dynamics models
vqc-trainer
Variational quantum classifier training skill with gradient optimization
noise-modeler
Quantum noise modeling skill for simulation and hardware characterization
pymc-bayesian-modeler
PyMC probabilistic programming skill for hierarchical Bayesian models in physics data analysis
comsol-multiphysics-modeler
COMSOL finite element skill for multiphysics simulations including electromagnetics, heat transfer, and fluid dynamics
environmental-fate-modeler
Environmental nanosafety skill for modeling nanomaterial environmental fate and transport
cad-modeling
Expert skill for parametric 3D CAD model development with design intent and configuration management
stan-bayesian-modeling
Stan probabilistic programming for Bayesian inference