pytorch-trainer

PyTorch model training skill with custom training loops, gradient management, and GPU optimization.

509 stars

Best use case

pytorch-trainer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

PyTorch model training skill with custom training loops, gradient management, and GPU optimization.

Teams using pytorch-trainer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/pytorch-trainer/SKILL.md --create-dirs "https://raw.githubusercontent.com/a5c-ai/babysitter/main/library/specializations/data-science-ml/skills/pytorch-trainer/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/pytorch-trainer/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How pytorch-trainer Compares

Feature / Agentpytorch-trainerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

PyTorch model training skill with custom training loops, gradient management, and GPU optimization.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# pytorch-trainer

## Overview

PyTorch model training skill with custom training loops, gradient management, GPU optimization, and integration with experiment tracking systems.

## Capabilities

- Custom training loop execution
- Learning rate scheduling (StepLR, CosineAnnealing, OneCycleLR, etc.)
- Gradient clipping and accumulation
- Mixed precision training (AMP)
- Checkpoint management and resumption
- DataLoader optimization
- Multi-GPU training (DataParallel, DistributedDataParallel)
- Early stopping with patience

## Target Processes

- Model Training Pipeline with Experiment Tracking
- Distributed Training Orchestration
- AutoML Pipeline Orchestration

## Tools and Libraries

- PyTorch
- PyTorch Lightning (optional)
- torchvision, torchaudio, torchtext
- CUDA toolkit

## Input Schema

```json
{
  "type": "object",
  "required": ["modelPath", "dataConfig", "trainingConfig"],
  "properties": {
    "modelPath": {
      "type": "string",
      "description": "Path to model definition file"
    },
    "dataConfig": {
      "type": "object",
      "properties": {
        "trainPath": { "type": "string" },
        "valPath": { "type": "string" },
        "batchSize": { "type": "integer" },
        "numWorkers": { "type": "integer" }
      }
    },
    "trainingConfig": {
      "type": "object",
      "properties": {
        "epochs": { "type": "integer" },
        "learningRate": { "type": "number" },
        "optimizer": { "type": "string" },
        "scheduler": { "type": "string" },
        "mixedPrecision": { "type": "boolean" },
        "gradientClipping": { "type": "number" },
        "gradientAccumulation": { "type": "integer" }
      }
    },
    "checkpointConfig": {
      "type": "object",
      "properties": {
        "saveDir": { "type": "string" },
        "saveEvery": { "type": "integer" },
        "resumeFrom": { "type": "string" }
      }
    }
  }
}
```

## Output Schema

```json
{
  "type": "object",
  "required": ["status", "metrics", "checkpointPath"],
  "properties": {
    "status": {
      "type": "string",
      "enum": ["success", "error", "early_stopped"]
    },
    "metrics": {
      "type": "object",
      "properties": {
        "trainLoss": { "type": "number" },
        "valLoss": { "type": "number" },
        "trainAccuracy": { "type": "number" },
        "valAccuracy": { "type": "number" },
        "epochsTrained": { "type": "integer" },
        "trainingTime": { "type": "number" }
      }
    },
    "checkpointPath": {
      "type": "string"
    },
    "learningCurve": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "epoch": { "type": "integer" },
          "trainLoss": { "type": "number" },
          "valLoss": { "type": "number" }
        }
      }
    }
  }
}
```

## Usage Example

```javascript
{
  kind: 'skill',
  title: 'Train PyTorch model',
  skill: {
    name: 'pytorch-trainer',
    context: {
      modelPath: 'models/resnet.py',
      dataConfig: {
        trainPath: 'data/train',
        valPath: 'data/val',
        batchSize: 32,
        numWorkers: 4
      },
      trainingConfig: {
        epochs: 100,
        learningRate: 0.001,
        optimizer: 'AdamW',
        scheduler: 'cosine',
        mixedPrecision: true,
        gradientClipping: 1.0
      }
    }
  }
}
```

Related Skills

vqc-trainer

509
from a5c-ai/babysitter

Variational quantum classifier training skill with gradient optimization

calibration-trainer

509
from a5c-ai/babysitter

Probability calibration training skill for improving forecast accuracy and reducing overconfidence

tensorflow-trainer

509
from a5c-ai/babysitter

TensorFlow/Keras model training skill with callbacks, distributed strategies, and TensorBoard integration.

sklearn-model-trainer

509
from a5c-ai/babysitter

Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.

ray-distributed-trainer

509
from a5c-ai/babysitter

Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.

process-builder

509
from a5c-ai/babysitter

Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.

Workflow & Productivity

babysitter

509
from a5c-ai/babysitter

Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)

yolo

509
from a5c-ai/babysitter

Run Babysitter autonomously with minimal manual interruption.

user-install

509
from a5c-ai/babysitter

Install the user-level Babysitter Codex setup.

team-install

509
from a5c-ai/babysitter

Install the team-pinned Babysitter Codex workspace setup.

retrospect

509
from a5c-ai/babysitter

Summarize or retrospect on a completed Babysitter run.

resume

509
from a5c-ai/babysitter

Resume an existing Babysitter run from Codex.