miles: Enterprise-Grade RL for Large-Scale Model Training

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

miles: Enterprise-Grade RL for Large-Scale Model Training is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using miles: Enterprise-Grade RL for Large-Scale Model Training should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/miles/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/Orchestra-Research/AI-Research-SKILLs/miles/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/miles/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How miles: Enterprise-Grade RL for Large-Scale Model Training Compares

Feature / Agent	miles: Enterprise-Grade RL for Large-Scale Model Training	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# miles: Enterprise-Grade RL for Large-Scale Model Training

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.

## When to Use miles

**Choose miles when you need:**
- Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
- FP8 or INT4 quantization-aware training
- Bit-wise identical train-inference alignment
- Speculative RL for maximum throughput
- Production stability with enterprise support

**Consider alternatives when:**
- You want the research-grade original → use **slime**
- You need flexible backend swapping → use **verl**
- You want PyTorch-native abstractions → use **torchforge**

## Key Features

### Low-Precision Training
- **Unified FP8**: End-to-end FP8 for both inference and training
- **INT4 QAT**: 1TB models on single-machine VRAM (H200)
- **Rollout Routing Replay (R3)**: Bit-wise expert alignment for MoE

### Performance Optimizations
- **Speculative RL**: 25%+ rollout speedup with online SFT draft models
- **Zero-Copy Weight Sync**: CUDA IPC zero-copy mapping
- **Partial Rollout**: Recycle half-finished trajectories

### Train-Inference Alignment
- **TIS/MIS**: Truncated/Masked Importance Sampling for off-policy correction
- **Kernel-level optimization**: FlashAttention-3, DeepGEMM integration

## Installation

```bash
# Recommended: Docker
docker pull radixark/miles:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it radixark/miles:latest /bin/bash

# From source
git clone https://github.com/radixark/miles.git
cd miles
pip install -r requirements.txt
pip install -e .
```

## Quick Start

miles inherits slime's configuration system. Basic training:

```bash
python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8
```

---

## Workflow 1: Large MoE Training

Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.

### Prerequisites Checklist
- [ ] H100/H200 GPUs with FP8 support
- [ ] MoE model (DeepSeek V3, Qwen3-MoE)
- [ ] Docker environment with miles

### Step 1: Environment Setup

```bash
# FP8 block scaling (recommended for stability)
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
```

### Step 2: Configure Training

```bash
python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000
```

### Verification Checklist
- [ ] Model loads without errors
- [ ] Routing decisions are consistent
- [ ] No NaN/Inf in loss values

---

## Workflow 2: Speculative RL Training

Use this workflow for maximum rollout throughput with EAGLE speculative decoding.

### How Speculative RL Works

1. Small draft model generates candidate tokens
2. Target model verifies in parallel
3. Draft model updated via online SFT to track policy

### Step 1: Enable Speculative Decoding

miles supports EAGLE speculative decoding via SGLang:

```bash
python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl
```

### Step 2: Enable Online MTP Training (Optional)

For online SFT of draft model during training:

```bash
--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2
```

**Note**: Online MTP training requires a torch dist checkpoint with MTP weights. Add `--mtp-num-layers 1` during checkpoint conversion from HuggingFace.

### Expected Speedup

- **Standard rollout**: Baseline
- **Speculative RL**: 25-40% faster rollout
- **With partial rollout**: Additional 10-15% throughput

---

## Configuration Reference

miles inherits all slime arguments. See [slime API Reference](../slime/references/api-reference.md) for the complete list.

### Cluster Resources (from slime)

```bash
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate
```

### Megatron Parallelism (from slime)

```bash
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE expert parallelism
```

### Speculative Decoding (miles-specific)

```bash
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path
```

### Online MTP Training (miles-specific)

```bash
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2
```

---

## Key Features (Conceptual)

The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.

### Unified FP8 Pipeline

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.

### Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.

**How R3 Works**:
1. During SGLang inference, expert routing decisions are recorded
2. Routing decisions stored in `sample.rollout_routed_experts`
3. During Megatron training, routing is replayed instead of recomputed
4. Ensures identical expert selection between train and inference

### INT4 Quantization-Aware Training

Enables single-machine deployment of 1TB+ models (e.g., on H200).

**Memory Savings with INT4**:

| Model Size | BF16 VRAM | INT4 VRAM | Reduction |
|------------|-----------|-----------|-----------|
| 70B | 140GB | 45GB | 3.1x |
| 235B | 470GB | 150GB | 3.1x |
| 671B | 1.3TB | 420GB | 3.1x |

### Train-Inference Alignment

miles achieves "exactly 0 KL divergence" between training and inference through:
- Flash Attention 3
- DeepGEMM
- Batch-invariant kernels from Thinking Machines Lab
- `torch.compile` integration

---

## Sample Data Structure

miles uses the same `Sample` dataclass as slime with the `rollout_routed_experts` field for MoE routing replay:

```python
@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # MoE routing for R3
```

See [slime API Reference](../slime/references/api-reference.md) for the complete Sample definition.

---

## Common Issues and Solutions

### Issue: FP8 Training Collapse

**Symptoms**: Loss explodes, NaN values

**Solutions**:
- Use block scaling: `export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1`
- Reduce learning rate: `--lr 5e-7`
- Ensure MoE routing is consistent between train/inference

### Issue: Speculative Draft Drift

**Symptoms**: Low acceptance rate over time

**Solutions**:
- Enable online MTP training to keep draft model aligned
- Reduce speculative steps: `--sglang-speculative-num-steps 2`
- Use CPU backup: `--sglang-enable-draft-weights-cpu-backup`

### Issue: Train-Inference Mismatch

**Symptoms**: Policy divergence, reward collapse

**Solutions**:
- Use TIS for off-policy correction: `--use-tis --tis-threshold 0.9`
- Verify log probs match between SGLang and Megatron
- Enable R3 for MoE models

---

## Supported Models

| Family | Models | MoE Support |
|--------|--------|-------------|
| DeepSeek | R1, V3, V3.2 | Full |
| Qwen | 2, 2.5, 3 (including MoE) | Full |
| Llama | 3, 3.1, 3.3, 4 | Dense only |
| Gemma | 2, 3, 3N | Dense only |
| GLM | 4.5, 4.6, 4.7 | Dense only |
| MiniMax | M2, M2.1 | Full |

---

## Resources

- **GitHub**: https://github.com/radixark/miles
- **Introduction Blog**: https://lmsys.org/blog/2025-11-19-miles/
- **Slime (upstream)**: https://github.com/THUDM/slime
- **SGLang**: https://github.com/sgl-project/sglang

Related Skills

next-upgrade

from ComeOnOliver/skillshub

Upgrade Next.js to the latest version following official migration guides and codemods

threat-modeling-expert

from ComeOnOliver/skillshub

Expert in threat modeling methodologies, security architecture review, and risk assessment. Masters STRIDE, PASTA, attack trees, and security requirement extraction. Use for security architecture reviews, threat identification, and secure-by-design planning.

startup-financial-modeling

from ComeOnOliver/skillshub

This skill should be used when the user asks to "create financial projections", "build a financial model", "forecast revenue", "calculate burn rate", "estimate runway", "model cash flow", or requests 3-5 year financial planning for a startup.

skill-rails-upgrade

from ComeOnOliver/skillshub

Analyze Rails apps and provide upgrade assessments

pydantic-models-py

from ComeOnOliver/skillshub

Create Pydantic models following the multi-model pattern with Base, Create, Update, Response, and InDB variants. Use when defining API request/response schemas, database models, or data validation in Python applications using Pydantic v2.

framework-migration-deps-upgrade

from ComeOnOliver/skillshub

You are a dependency management expert specializing in safe, incremental upgrades of project dependencies. Plan and execute dependency updates with minimal risk, proper testing, and clear migration pa

fal-upscale

from ComeOnOliver/skillshub

Upscale and enhance image and video resolution using AI

dependency-upgrade

from ComeOnOliver/skillshub

Manage major dependency version upgrades with compatibility analysis, staged rollout, and comprehensive testing. Use when upgrading framework versions, updating major dependencies, or managing breaking changes in libraries.

avalonia-viewmodels-zafiro

from ComeOnOliver/skillshub

Optimal ViewModel and Wizard creation patterns for Avalonia using Zafiro and ReactiveUI.

deploy-model

from ComeOnOliver/skillshub

Unified Azure OpenAI model deployment skill with intelligent intent-based routing. Handles quick preset deployments, fully customized deployments (version/SKU/capacity/RAI policy), and capacity discovery across regions and projects. USE FOR: deploy model, deploy gpt, create deployment, model deployment, deploy openai model, set up model, provision model, find capacity, check model availability, where can I deploy, best region for model, capacity analysis. DO NOT USE FOR: listing existing deployments (use foundry_models_deployments_list MCP tool), deleting deployments, agent creation (use agent/create), project creation (use project/create).

azure-upgrade

from ComeOnOliver/skillshub

Assess and upgrade Azure workloads between plans, tiers, or SKUs within Azure. Generates assessment reports and automates upgrade steps. WHEN: upgrade Consumption to Flex Consumption, upgrade Azure Functions plan, migrate hosting plan, upgrade Functions SKU, move to Flex Consumption, upgrade Azure service tier, change hosting plan, upgrade function app plan, migrate App Service to Container Apps.

azure-enterprise-infra-planner

from ComeOnOliver/skillshub

Architect and provision enterprise Azure infrastructure from workload descriptions. For cloud architects and platform engineers planning networking, identity, security, compliance, and multi-resource topologies with WAF alignment. Generates Bicep or Terraform directly (no azd). WHEN: 'plan Azure infrastructure', 'architect Azure landing zone', 'design hub-spoke network', 'plan multi-region DR topology', 'set up VNets firewalls and private endpoints', 'subscription-scope Bicep deployment'. PREFER azure-prepare FOR app-centric workflows.