distributed-llm-pretraining-torchtitan
Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Best use case
distributed-llm-pretraining-torchtitan is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Teams using distributed-llm-pretraining-torchtitan should expect a more consistent output, faster repeated execution, less prompt rewriting, better workflow continuity with your supporting tools.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
- You already have the supporting tools or dependencies needed by this skill.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/model-architecture-torchtitan/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How distributed-llm-pretraining-torchtitan Compares
| Feature / Agent | distributed-llm-pretraining-torchtitan | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
SKILL.md Source
# TorchTitan - PyTorch Native Distributed LLM Pretraining ## Quick start TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs. **Installation**: ```bash # From PyPI (stable) pip install torchtitan # From source (latest features, requires PyTorch nightly) git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt ``` **Download tokenizer**: ```bash # Get HF token from https://huggingface.co/settings/tokens python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=... ``` **Start training on 8 GPUs**: ```bash CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh ``` ## Common workflows ### Workflow 1: Pretrain Llama 3.1 8B on single node Copy this checklist: ``` Single Node Pretraining: - [ ] Step 1: Download tokenizer - [ ] Step 2: Configure training - [ ] Step 3: Launch training - [ ] Step 4: Monitor and checkpoint ``` **Step 1: Download tokenizer** ```bash python scripts/download_hf_assets.py \ --repo_id meta-llama/Llama-3.1-8B \ --assets tokenizer \ --hf_token=YOUR_HF_TOKEN ``` **Step 2: Configure training** Edit or create a TOML config file: ```toml # llama3_8b_custom.toml [job] dump_folder = "./outputs" description = "Llama 3.1 8B training" [model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B" [optimizer] name = "AdamW" lr = 3e-4 [lr_scheduler] warmup_steps = 200 [training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4" [parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP [activation_checkpoint] mode = "selective" selective_ac_option = "op" [checkpoint] enable = true folder = "checkpoint" interval = 500 ``` **Step 3: Launch training** ```bash # 8 GPUs on single node CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh # Or explicitly with torchrun torchrun --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_8b_custom.toml ``` **Step 4: Monitor and checkpoint** TensorBoard logs are saved to `./outputs/tb/`: ```bash tensorboard --logdir ./outputs/tb ``` ### Workflow 2: Multi-node training with SLURM ``` Multi-Node Training: - [ ] Step 1: Configure parallelism for scale - [ ] Step 2: Set up SLURM script - [ ] Step 3: Submit job - [ ] Step 4: Resume from checkpoint ``` **Step 1: Configure parallelism for scale** For 70B model on 256 GPUs (32 nodes): ```toml [parallelism] data_parallel_shard_degree = 32 # FSDP across 32 ranks tensor_parallel_degree = 8 # TP within node pipeline_parallel_degree = 1 # No PP for 70B context_parallel_degree = 1 # Increase for long sequences ``` **Step 2: Set up SLURM script** ```bash #!/bin/bash #SBATCH --job-name=llama70b #SBATCH --nodes=32 #SBATCH --ntasks-per-node=8 #SBATCH --gpus-per-node=8 srun torchrun \ --nnodes=32 \ --nproc_per_node=8 \ --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ -m torchtitan.train \ --job.config_file ./llama3_70b.toml ``` **Step 3: Submit job** ```bash sbatch multinode_trainer.slurm ``` **Step 4: Resume from checkpoint** Training auto-resumes if checkpoint exists in configured folder. ### Workflow 3: Enable Float8 training for H100s Float8 provides 30-50% speedup on H100 GPUs. ``` Float8 Training: - [ ] Step 1: Install torchao - [ ] Step 2: Configure Float8 - [ ] Step 3: Launch with compile ``` **Step 1: Install torchao** ```bash USE_CPP=0 pip install git+https://github.com/pytorch/ao.git ``` **Step 2: Configure Float8** Add to your TOML config: ```toml [model] converters = ["quantize.linear.float8"] [quantize.linear.float8] enable_fsdp_float8_all_gather = true precompute_float8_dynamic_scale_for_fsdp = true filter_fqns = ["output"] # Exclude output layer [compile] enable = true components = ["model", "loss"] ``` **Step 3: Launch with compile** ```bash CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \ --model.converters="quantize.linear.float8" \ --quantize.linear.float8.enable_fsdp_float8_all_gather \ --compile.enable ``` ### Workflow 4: 4D parallelism for 405B models ``` 4D Parallelism (FSDP + TP + PP + CP): - [ ] Step 1: Create seed checkpoint - [ ] Step 2: Configure 4D parallelism - [ ] Step 3: Launch on 512 GPUs ``` **Step 1: Create seed checkpoint** Required for consistent initialization across PP stages: ```bash NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \ --checkpoint.enable \ --checkpoint.create_seed_checkpoint \ --parallelism.data_parallel_shard_degree 1 \ --parallelism.tensor_parallel_degree 1 \ --parallelism.pipeline_parallel_degree 1 ``` **Step 2: Configure 4D parallelism** ```toml [parallelism] data_parallel_shard_degree = 8 # FSDP tensor_parallel_degree = 8 # TP within node pipeline_parallel_degree = 8 # PP across nodes context_parallel_degree = 1 # CP for long sequences [training] local_batch_size = 32 seq_len = 8192 ``` **Step 3: Launch on 512 GPUs** ```bash # 64 nodes x 8 GPUs = 512 GPUs srun torchrun --nnodes=64 --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_405b.toml ``` ## When to use vs alternatives **Use TorchTitan when:** - Pretraining LLMs from scratch (8B to 405B+) - Need PyTorch-native solution without third-party dependencies - Require composable 4D parallelism (FSDP2, TP, PP, CP) - Training on H100s with Float8 support - Want interoperable checkpoints with torchtune/HuggingFace **Use alternatives instead:** - **Megatron-LM**: Maximum performance for NVIDIA-only deployments - **DeepSpeed**: Broader ZeRO optimization ecosystem, inference support - **Axolotl/TRL**: Fine-tuning rather than pretraining - **LitGPT**: Educational, smaller-scale training ## Common issues **Issue: Out of memory on large models** Enable activation checkpointing and reduce batch size: ```toml [activation_checkpoint] mode = "full" # Instead of "selective" [training] local_batch_size = 1 ``` Or use gradient accumulation: ```toml [training] local_batch_size = 1 global_batch_size = 32 # Accumulates gradients ``` **Issue: TP causes high memory with async collectives** Set environment variable: ```bash export TORCH_NCCL_AVOID_RECORD_STREAMS=1 ``` **Issue: Float8 training not faster** Float8 only benefits large GEMMs. Filter small layers: ```toml [quantize.linear.float8] filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"] ``` **Issue: Checkpoint loading fails after parallelism change** Use DCP's resharding capability: ```bash # Convert sharded checkpoint to single file python -m torch.distributed.checkpoint.format_utils \ dcp_to_torch checkpoint/step-1000 checkpoint.pt ``` **Issue: Pipeline parallelism initialization** Create seed checkpoint first (see Workflow 4, Step 1). ## Supported models | Model | Sizes | Status | |-------|-------|--------| | Llama 3.1 | 8B, 70B, 405B | Production | | Llama 4 | Various | Experimental | | DeepSeek V3 | 16B, 236B, 671B (MoE) | Experimental | | GPT-OSS | 20B, 120B (MoE) | Experimental | | Qwen 3 | Various | Experimental | | Flux | Diffusion | Experimental | ## Performance benchmarks (H100) | Model | GPUs | Parallelism | TPS/GPU | Techniques | |-------|------|-------------|---------|------------| | Llama 8B | 8 | FSDP | 5,762 | Baseline | | Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% | | Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D parallel | | Llama 405B | 512 | FSDP+TP+PP | 128 | 3D parallel | ## Advanced topics **FSDP2 configuration**: See [references/fsdp.md](references/fsdp.md) for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents. **Float8 training**: See [references/float8.md](references/float8.md) for tensorwise vs rowwise scaling recipes. **Checkpointing**: See [references/checkpoint.md](references/checkpoint.md) for HuggingFace conversion and async checkpointing. **Adding custom models**: See [references/custom-models.md](references/custom-models.md) for TrainSpec protocol. ## Resources - GitHub: https://github.com/pytorch/torchtitan - Paper: https://arxiv.org/abs/2410.06511 - ICLR 2025: https://iclr.cc/virtual/2025/poster/29620 - PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44
Related Skills
async-python-patterns
Comprehensive guidance for implementing asynchronous Python applications using asyncio, concurrent programming patterns, and async/await for building high-performance, non-blocking systems.
slack-automation
Automate Slack workspace operations including messaging, search, channel management, and reaction workflows through Composio's Slack toolkit.
linear-automation
Automate Linear tasks via Rube MCP (Composio): issues, projects, cycles, teams, labels. Always search tools first for current schemas.
jira-automation
Automate Jira tasks via Rube MCP (Composio): issues, projects, sprints, boards, comments, users. Always search tools first for current schemas.
gitops-workflow
Complete guide to implementing GitOps workflows with ArgoCD and Flux for automated Kubernetes deployments.
github-automation
Automate GitHub repositories, issues, pull requests, branches, CI/CD, and permissions via Rube MCP (Composio). Manage code workflows, review PRs, search code, and handle deployments programmatically.
github-actions-templates
Production-ready GitHub Actions workflow patterns for testing, building, and deploying applications.
zustand-store-ts
Create Zustand stores following established patterns with proper TypeScript types and middleware.
zod-validation-expert
Expert in Zod — TypeScript-first schema validation. Covers parsing, custom errors, refinements, type inference, and integration with React Hook Form, Next.js, and tRPC.
tanstack-query-expert
Expert in TanStack Query (React Query) — asynchronous state management. Covers data fetching, stale time configuration, mutations, optimistic updates, and Next.js App Router (SSR) integration.
tailwind-design-system
Build production-ready design systems with Tailwind CSS, including design tokens, component variants, responsive patterns, and accessibility.
sveltekit
Build full-stack web applications with SvelteKit — file-based routing, SSR, SSG, API routes, and form actions in one framework.