ray-distributed-trainer

Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.

509 stars

Best use case

ray-distributed-trainer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.

Teams using ray-distributed-trainer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ray-distributed-trainer/SKILL.md --create-dirs "https://raw.githubusercontent.com/a5c-ai/babysitter/main/library/specializations/data-science-ml/skills/ray-distributed-trainer/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ray-distributed-trainer/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ray-distributed-trainer Compares

Feature / Agent	ray-distributed-trainer	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# ray-distributed-trainer

## Overview

Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management across clusters.

## Capabilities

- Ray Train for distributed training
- Ray Tune for hyperparameter search at scale
- Cluster resource management
- Fault tolerance and checkpointing
- Actor-based parallelism
- Integration with PyTorch and TensorFlow
- Elastic training support
- Multi-node orchestration

## Target Processes

- Distributed Training Orchestration
- AutoML Pipeline Orchestration
- Model Training Pipeline

## Tools and Libraries

- Ray
- Ray Train
- Ray Tune
- Ray Cluster

## Input Schema

```json
{
  "type": "object",
  "required": ["mode", "config"],
  "properties": {
    "mode": {
      "type": "string",
      "enum": ["train", "tune", "cluster"],
      "description": "Ray operation mode"
    },
    "config": {
      "type": "object",
      "properties": {
        "numWorkers": { "type": "integer" },
        "useGpu": { "type": "boolean" },
        "resourcesPerWorker": {
          "type": "object",
          "properties": {
            "cpu": { "type": "number" },
            "gpu": { "type": "number" }
          }
        }
      }
    },
    "trainConfig": {
      "type": "object",
      "properties": {
        "trainerPath": { "type": "string" },
        "framework": { "type": "string", "enum": ["pytorch", "tensorflow", "xgboost"] },
        "scalingConfig": { "type": "object" }
      }
    },
    "tuneConfig": {
      "type": "object",
      "properties": {
        "searchSpace": { "type": "object" },
        "scheduler": { "type": "string" },
        "numSamples": { "type": "integer" },
        "metric": { "type": "string" },
        "mode": { "type": "string", "enum": ["min", "max"] }
      }
    }
  }
}
```

## Output Schema

```json
{
  "type": "object",
  "required": ["status", "results"],
  "properties": {
    "status": {
      "type": "string",
      "enum": ["success", "error", "partial"]
    },
    "results": {
      "type": "object",
      "properties": {
        "bestConfig": { "type": "object" },
        "bestMetric": { "type": "number" },
        "numTrials": { "type": "integer" },
        "completedTrials": { "type": "integer" }
      }
    },
    "checkpointPath": {
      "type": "string"
    },
    "clusterStatus": {
      "type": "object",
      "properties": {
        "numNodes": { "type": "integer" },
        "totalCpu": { "type": "number" },
        "totalGpu": { "type": "number" }
      }
    },
    "trainingTime": {
      "type": "number"
    }
  }
}
```

## Usage Example

```javascript
{
  kind: 'skill',
  title: 'Distributed hyperparameter tuning',
  skill: {
    name: 'ray-distributed-trainer',
    context: {
      mode: 'tune',
      config: {
        numWorkers: 4,
        useGpu: true,
        resourcesPerWorker: { cpu: 2, gpu: 1 }
      },
      tuneConfig: {
        searchSpace: {
          lr: { type: 'loguniform', min: 1e-5, max: 1e-1 },
          batchSize: { type: 'choice', values: [16, 32, 64] }
        },
        scheduler: 'asha',
        numSamples: 100,
        metric: 'val_loss',
        mode: 'min'
      }
    }
  }
}
```

Related Skills

distributed-caching

509

from a5c-ai/babysitter

Expert skill for distributed cache design, implementation, and optimization using Redis and Memcached. Design cache architectures, configure eviction policies, implement caching patterns (cache-aside, write-through, write-behind), monitor cache performance, and optimize memory usage.

vqc-trainer

509

from a5c-ai/babysitter

Variational quantum classifier training skill with gradient optimization

calibration-trainer

509

from a5c-ai/babysitter

Probability calibration training skill for improving forecast accuracy and reducing overconfidence

tensorflow-trainer

509

from a5c-ai/babysitter

TensorFlow/Keras model training skill with callbacks, distributed strategies, and TensorBoard integration.

sklearn-model-trainer

509

from a5c-ai/babysitter

Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.

pytorch-trainer

509

from a5c-ai/babysitter

PyTorch model training skill with custom training loops, gradient management, and GPU optimization.

process-builder

509

from a5c-ai/babysitter

Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.

Workflow & Productivity

babysitter

509

from a5c-ai/babysitter

Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)

yolo

509

from a5c-ai/babysitter

Run Babysitter autonomously with minimal manual interruption.

user-install

509

from a5c-ai/babysitter

Install the user-level Babysitter Codex setup.

team-install

509

from a5c-ai/babysitter

Install the team-pinned Babysitter Codex workspace setup.

retrospect

509

from a5c-ai/babysitter

Summarize or retrospect on a completed Babysitter run.