TorchTitan - PyTorch Native Distributed LLM Pretraining
## Quick start
Best use case
TorchTitan - PyTorch Native Distributed LLM Pretraining is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
## Quick start
Teams using TorchTitan - PyTorch Native Distributed LLM Pretraining should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/torchtitan/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How TorchTitan - PyTorch Native Distributed LLM Pretraining Compares
| Feature / Agent | TorchTitan - PyTorch Native Distributed LLM Pretraining | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
## Quick start
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# TorchTitan - PyTorch Native Distributed LLM Pretraining ## Quick start TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs. **Installation**: ```bash # From PyPI (stable) pip install torchtitan # From source (latest features, requires PyTorch nightly) git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt ``` **Download tokenizer**: ```bash # Get HF token from https://huggingface.co/settings/tokens python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=... ``` **Start training on 8 GPUs**: ```bash CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh ``` ## Common workflows ### Workflow 1: Pretrain Llama 3.1 8B on single node Copy this checklist: ``` Single Node Pretraining: - [ ] Step 1: Download tokenizer - [ ] Step 2: Configure training - [ ] Step 3: Launch training - [ ] Step 4: Monitor and checkpoint ``` **Step 1: Download tokenizer** ```bash python scripts/download_hf_assets.py \ --repo_id meta-llama/Llama-3.1-8B \ --assets tokenizer \ --hf_token=YOUR_HF_TOKEN ``` **Step 2: Configure training** Edit or create a TOML config file: ```toml # llama3_8b_custom.toml [job] dump_folder = "./outputs" description = "Llama 3.1 8B training" [model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B" [optimizer] name = "AdamW" lr = 3e-4 [lr_scheduler] warmup_steps = 200 [training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4" [parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP [activation_checkpoint] mode = "selective" selective_ac_option = "op" [checkpoint] enable = true folder = "checkpoint" interval = 500 ``` **Step 3: Launch training** ```bash # 8 GPUs on single node CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh # Or explicitly with torchrun torchrun --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_8b_custom.toml ``` **Step 4: Monitor and checkpoint** TensorBoard logs are saved to `./outputs/tb/`: ```bash tensorboard --logdir ./outputs/tb ``` ### Workflow 2: Multi-node training with SLURM ``` Multi-Node Training: - [ ] Step 1: Configure parallelism for scale - [ ] Step 2: Set up SLURM script - [ ] Step 3: Submit job - [ ] Step 4: Resume from checkpoint ``` **Step 1: Configure parallelism for scale** For 70B model on 256 GPUs (32 nodes): ```toml [parallelism] data_parallel_shard_degree = 32 # FSDP across 32 ranks tensor_parallel_degree = 8 # TP within node pipeline_parallel_degree = 1 # No PP for 70B context_parallel_degree = 1 # Increase for long sequences ``` **Step 2: Set up SLURM script** ```bash #!/bin/bash #SBATCH --job-name=llama70b #SBATCH --nodes=32 #SBATCH --ntasks-per-node=8 #SBATCH --gpus-per-node=8 srun torchrun \ --nnodes=32 \ --nproc_per_node=8 \ --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ -m torchtitan.train \ --job.config_file ./llama3_70b.toml ``` **Step 3: Submit job** ```bash sbatch multinode_trainer.slurm ``` **Step 4: Resume from checkpoint** Training auto-resumes if checkpoint exists in configured folder. ### Workflow 3: Enable Float8 training for H100s Float8 provides 30-50% speedup on H100 GPUs. ``` Float8 Training: - [ ] Step 1: Install torchao - [ ] Step 2: Configure Float8 - [ ] Step 3: Launch with compile ``` **Step 1: Install torchao** ```bash USE_CPP=0 pip install git+https://github.com/pytorch/ao.git ``` **Step 2: Configure Float8** Add to your TOML config: ```toml [model] converters = ["quantize.linear.float8"] [quantize.linear.float8] enable_fsdp_float8_all_gather = true precompute_float8_dynamic_scale_for_fsdp = true filter_fqns = ["output"] # Exclude output layer [compile] enable = true components = ["model", "loss"] ``` **Step 3: Launch with compile** ```bash CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \ --model.converters="quantize.linear.float8" \ --quantize.linear.float8.enable_fsdp_float8_all_gather \ --compile.enable ``` ### Workflow 4: 4D parallelism for 405B models ``` 4D Parallelism (FSDP + TP + PP + CP): - [ ] Step 1: Create seed checkpoint - [ ] Step 2: Configure 4D parallelism - [ ] Step 3: Launch on 512 GPUs ``` **Step 1: Create seed checkpoint** Required for consistent initialization across PP stages: ```bash NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \ --checkpoint.enable \ --checkpoint.create_seed_checkpoint \ --parallelism.data_parallel_shard_degree 1 \ --parallelism.tensor_parallel_degree 1 \ --parallelism.pipeline_parallel_degree 1 ``` **Step 2: Configure 4D parallelism** ```toml [parallelism] data_parallel_shard_degree = 8 # FSDP tensor_parallel_degree = 8 # TP within node pipeline_parallel_degree = 8 # PP across nodes context_parallel_degree = 1 # CP for long sequences [training] local_batch_size = 32 seq_len = 8192 ``` **Step 3: Launch on 512 GPUs** ```bash # 64 nodes x 8 GPUs = 512 GPUs srun torchrun --nnodes=64 --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_405b.toml ``` ## When to use vs alternatives **Use TorchTitan when:** - Pretraining LLMs from scratch (8B to 405B+) - Need PyTorch-native solution without third-party dependencies - Require composable 4D parallelism (FSDP2, TP, PP, CP) - Training on H100s with Float8 support - Want interoperable checkpoints with torchtune/HuggingFace **Use alternatives instead:** - **Megatron-LM**: Maximum performance for NVIDIA-only deployments - **DeepSpeed**: Broader ZeRO optimization ecosystem, inference support - **Axolotl/TRL**: Fine-tuning rather than pretraining - **LitGPT**: Educational, smaller-scale training ## Common issues **Issue: Out of memory on large models** Enable activation checkpointing and reduce batch size: ```toml [activation_checkpoint] mode = "full" # Instead of "selective" [training] local_batch_size = 1 ``` Or use gradient accumulation: ```toml [training] local_batch_size = 1 global_batch_size = 32 # Accumulates gradients ``` **Issue: TP causes high memory with async collectives** Set environment variable: ```bash export TORCH_NCCL_AVOID_RECORD_STREAMS=1 ``` **Issue: Float8 training not faster** Float8 only benefits large GEMMs. Filter small layers: ```toml [quantize.linear.float8] filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"] ``` **Issue: Checkpoint loading fails after parallelism change** Use DCP's resharding capability: ```bash # Convert sharded checkpoint to single file python -m torch.distributed.checkpoint.format_utils \ dcp_to_torch checkpoint/step-1000 checkpoint.pt ``` **Issue: Pipeline parallelism initialization** Create seed checkpoint first (see Workflow 4, Step 1). ## Supported models | Model | Sizes | Status | |-------|-------|--------| | Llama 3.1 | 8B, 70B, 405B | Production | | Llama 4 | Various | Experimental | | DeepSeek V3 | 16B, 236B, 671B (MoE) | Experimental | | GPT-OSS | 20B, 120B (MoE) | Experimental | | Qwen 3 | Various | Experimental | | Flux | Diffusion | Experimental | ## Performance benchmarks (H100) | Model | GPUs | Parallelism | TPS/GPU | Techniques | |-------|------|-------------|---------|------------| | Llama 8B | 8 | FSDP | 5,762 | Baseline | | Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% | | Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D parallel | | Llama 405B | 512 | FSDP+TP+PP | 128 | 3D parallel | ## Advanced topics **FSDP2 configuration**: See [references/fsdp.md](references/fsdp.md) for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents. **Float8 training**: See [references/float8.md](references/float8.md) for tensorwise vs rowwise scaling recipes. **Checkpointing**: See [references/checkpoint.md](references/checkpoint.md) for HuggingFace conversion and async checkpointing. **Adding custom models**: See [references/custom-models.md](references/custom-models.md) for TrainSpec protocol. ## Resources - GitHub: https://github.com/pytorch/torchtitan - Paper: https://arxiv.org/abs/2410.06511 - ICLR 2025: https://iclr.cc/virtual/2025/poster/29620 - PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44
Related Skills
setting-up-distributed-tracing
Execute this skill automates the setup of distributed tracing for microservices. it helps developers implement end-to-end request visibility by configuring context propagation, span creation, trace collection, and analysis. use this skill when the user re... Use when appropriate context detected. Trigger with relevant phrases based on skill purpose.
pytorch-model-trainer
Pytorch Model Trainer - Auto-activating skill for ML Training. Triggers on: pytorch model trainer, pytorch model trainer Part of the ML Training skill category.
distributed-training-setup
Distributed Training Setup - Auto-activating skill for ML Training. Triggers on: distributed training setup, distributed training setup Part of the ML Training skill category.
defold-native-extension-editing
Defold native extension development. Use when creating or editing C/C++ (.c, .cpp, .h, .hpp), JavaScript (.js), or manifest files in native extension directories (src/, include/, lib/, api/).
java-add-graalvm-native-image-support
GraalVM Native Image expert that adds native image support to Java applications, builds the project, analyzes build errors, applies fixes, and iterates until successful compilation using Oracle best practices.
upgrading-react-native
Upgrades React Native apps to newer versions by applying rn-diff-purge template diffs, updating package.json dependencies, migrating native iOS and Android configuration, resolving CocoaPods and Gradle changes, and handling breaking API updates. Use when upgrading React Native, bumping RN version, updating from RN 0.x to 0.y, or migrating Expo SDK alongside a React Native upgrade.
react-native-brownfield-migration
Provides an incremental adoption strategy to migrate native iOS or Android apps to React Native or Expo using @callstack/react-native-brownfield for initial setup. Use when planning migration steps, packaging XCFramework/AAR artifacts, and integrating them into host apps.
react-native-design
Master React Native styling, navigation, and Reanimated animations for cross-platform mobile development. Use when building React Native apps, implementing navigation patterns, or creating performant animations.
vercel-react-native-skills
React Native and Expo best practices for building performant mobile apps. Use when building React Native components, optimizing list performance, implementing animations, or working with native modules. Triggers on tasks involving React Native, Expo, mobile performance, or native platform APIs.
native-data-fetching
Use when implementing or debugging ANY network request, API call, or data fetching. Covers fetch API, axios, React Query, SWR, error handling, caching strategies, offline support.
building-native-ui
Complete guide for building beautiful apps with Expo Router. Covers fundamentals, styling, components, navigation, animations, patterns, and native tabs.
orchestration-native-invoke
Invoke external AI CLIs via native Task agents (Claude, Codex, Gemini, Cursor). Primary mode for multi-provider orchestration with fork-terminal fallback for auth.