ray-distributed-trainer
Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.
Best use case
ray-distributed-trainer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.
Teams using ray-distributed-trainer should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/ray-distributed-trainer/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How ray-distributed-trainer Compares
| Feature / Agent | ray-distributed-trainer | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# ray-distributed-trainer
## Overview
Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management across clusters.
## Capabilities
- Ray Train for distributed training
- Ray Tune for hyperparameter search at scale
- Cluster resource management
- Fault tolerance and checkpointing
- Actor-based parallelism
- Integration with PyTorch and TensorFlow
- Elastic training support
- Multi-node orchestration
## Target Processes
- Distributed Training Orchestration
- AutoML Pipeline Orchestration
- Model Training Pipeline
## Tools and Libraries
- Ray
- Ray Train
- Ray Tune
- Ray Cluster
## Input Schema
```json
{
"type": "object",
"required": ["mode", "config"],
"properties": {
"mode": {
"type": "string",
"enum": ["train", "tune", "cluster"],
"description": "Ray operation mode"
},
"config": {
"type": "object",
"properties": {
"numWorkers": { "type": "integer" },
"useGpu": { "type": "boolean" },
"resourcesPerWorker": {
"type": "object",
"properties": {
"cpu": { "type": "number" },
"gpu": { "type": "number" }
}
}
}
},
"trainConfig": {
"type": "object",
"properties": {
"trainerPath": { "type": "string" },
"framework": { "type": "string", "enum": ["pytorch", "tensorflow", "xgboost"] },
"scalingConfig": { "type": "object" }
}
},
"tuneConfig": {
"type": "object",
"properties": {
"searchSpace": { "type": "object" },
"scheduler": { "type": "string" },
"numSamples": { "type": "integer" },
"metric": { "type": "string" },
"mode": { "type": "string", "enum": ["min", "max"] }
}
}
}
}
```
## Output Schema
```json
{
"type": "object",
"required": ["status", "results"],
"properties": {
"status": {
"type": "string",
"enum": ["success", "error", "partial"]
},
"results": {
"type": "object",
"properties": {
"bestConfig": { "type": "object" },
"bestMetric": { "type": "number" },
"numTrials": { "type": "integer" },
"completedTrials": { "type": "integer" }
}
},
"checkpointPath": {
"type": "string"
},
"clusterStatus": {
"type": "object",
"properties": {
"numNodes": { "type": "integer" },
"totalCpu": { "type": "number" },
"totalGpu": { "type": "number" }
}
},
"trainingTime": {
"type": "number"
}
}
}
```
## Usage Example
```javascript
{
kind: 'skill',
title: 'Distributed hyperparameter tuning',
skill: {
name: 'ray-distributed-trainer',
context: {
mode: 'tune',
config: {
numWorkers: 4,
useGpu: true,
resourcesPerWorker: { cpu: 2, gpu: 1 }
},
tuneConfig: {
searchSpace: {
lr: { type: 'loguniform', min: 1e-5, max: 1e-1 },
batchSize: { type: 'choice', values: [16, 32, 64] }
},
scheduler: 'asha',
numSamples: 100,
metric: 'val_loss',
mode: 'min'
}
}
}
}
```Related Skills
distributed-caching
Expert skill for distributed cache design, implementation, and optimization using Redis and Memcached. Design cache architectures, configure eviction policies, implement caching patterns (cache-aside, write-through, write-behind), monitor cache performance, and optimize memory usage.
vqc-trainer
Variational quantum classifier training skill with gradient optimization
calibration-trainer
Probability calibration training skill for improving forecast accuracy and reducing overconfidence
tensorflow-trainer
TensorFlow/Keras model training skill with callbacks, distributed strategies, and TensorBoard integration.
sklearn-model-trainer
Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.
pytorch-trainer
PyTorch model training skill with custom training loops, gradient management, and GPU optimization.
process-builder
Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.
babysitter
Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)
yolo
Run Babysitter autonomously with minimal manual interruption.
user-install
Install the user-level Babysitter Codex setup.
team-install
Install the team-pinned Babysitter Codex workspace setup.
retrospect
Summarize or retrospect on a completed Babysitter run.