pytorch-lightning-guide
PyTorch Lightning framework for scalable model training and research
Best use case
pytorch-lightning-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
PyTorch Lightning framework for scalable model training and research
Teams using pytorch-lightning-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/pytorch-lightning-guide/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How pytorch-lightning-guide Compares
| Feature / Agent | pytorch-lightning-guide | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
PyTorch Lightning framework for scalable model training and research
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# PyTorch Lightning Guide
## Overview
PyTorch Lightning is a deep learning framework with over 31,000 GitHub stars that provides a high-level interface for PyTorch, enabling researchers to focus on model design rather than engineering boilerplate. Developed by Lightning AI, it decouples the science (model architecture, loss functions, data processing) from the engineering (distributed training, mixed precision, gradient accumulation, checkpointing) through a structured `LightningModule` abstraction.
For academic researchers, Lightning eliminates the need to write repetitive training loops, device management code, and distributed training logic. You define your model, training step, and data loaders, and Lightning handles everything else -- from single GPU to multi-node distributed training, from FP32 to mixed precision, from local development to cloud deployment. This means faster iteration on research ideas with production-quality training infrastructure.
Lightning is used extensively in AI research labs and has become a standard tool for reproducible deep learning experiments. It integrates seamlessly with experiment tracking tools like Weights & Biases, MLflow, and TensorBoard, and supports all PyTorch-compatible model architectures.
## Installation and Setup
```bash
# Install PyTorch Lightning
pip install lightning
# Or install with specific extras
pip install lightning[extra]
# For development/research with all features
pip install lightning[all]
```
Lightning requires Python 3.9+ and PyTorch 2.1+. For GPU training, ensure your PyTorch installation includes CUDA support:
```bash
# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"
```
Verify your installation:
```python
import lightning as L
print(L.__version__)
```
## Core Architecture
### The LightningModule
The `LightningModule` is the central abstraction. It organizes your PyTorch code into clearly defined methods:
```python
import lightning as L
import torch
import torch.nn.functional as F
from torch import nn
class ResearchModel(L.LightningModule):
def __init__(self, input_dim, hidden_dim, output_dim, lr=1e-3):
super().__init__()
self.save_hyperparameters()
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
)
self.classifier = nn.Linear(hidden_dim, output_dim)
self.lr = lr
def forward(self, x):
features = self.encoder(x)
return self.classifier(features)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = F.cross_entropy(logits, y)
acc = (logits.argmax(dim=-1) == y).float().mean()
self.log("train_loss", loss, prog_bar=True)
self.log("train_acc", acc, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = F.cross_entropy(logits, y)
acc = (logits.argmax(dim=-1) == y).float().mean()
self.log("val_loss", loss, prog_bar=True)
self.log("val_acc", acc, prog_bar=True)
def configure_optimizers(self):
optimizer = torch.optim.AdamW(self.parameters(), lr=self.lr)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=self.trainer.max_epochs
)
return [optimizer], [scheduler]
```
### The LightningDataModule
Encapsulate all data processing in a reusable `LightningDataModule`:
```python
class ResearchDataModule(L.LightningDataModule):
def __init__(self, data_dir, batch_size=32, num_workers=4):
super().__init__()
self.data_dir = data_dir
self.batch_size = batch_size
self.num_workers = num_workers
def setup(self, stage=None):
# Load and split data
dataset = load_research_dataset(self.data_dir)
self.train_data, self.val_data, self.test_data = random_split(
dataset, [0.8, 0.1, 0.1]
)
def train_dataloader(self):
return DataLoader(self.train_data, batch_size=self.batch_size,
shuffle=True, num_workers=self.num_workers)
def val_dataloader(self):
return DataLoader(self.val_data, batch_size=self.batch_size,
num_workers=self.num_workers)
```
### The Trainer
The `Trainer` orchestrates everything with a rich set of configuration options:
```python
trainer = L.Trainer(
max_epochs=100,
accelerator="gpu",
devices=4,
strategy="ddp",
precision="16-mixed",
gradient_clip_val=1.0,
accumulate_grad_batches=4,
callbacks=[
L.callbacks.EarlyStopping(monitor="val_loss", patience=10),
L.callbacks.ModelCheckpoint(monitor="val_loss", save_top_k=3),
L.callbacks.LearningRateMonitor(),
],
logger=L.loggers.WandbLogger(project="my-research"),
)
# Train the model
trainer.fit(model, datamodule=data_module)
# Test with best checkpoint
trainer.test(model, datamodule=data_module, ckpt_path="best")
```
## Advanced Research Features
### Distributed Training Strategies
Lightning supports multiple distributed training strategies out of the box:
- **DDP (Distributed Data Parallel)**: Standard multi-GPU training
- **FSDP (Fully Sharded Data Parallel)**: Memory-efficient training for large models
- **DeepSpeed**: ZeRO optimization stages 1, 2, and 3
```python
# FSDP for large model training
trainer = L.Trainer(
strategy="fsdp",
devices=8,
precision="bf16-mixed",
)
```
### Custom Training Loops
Override the training loop for non-standard research workflows like GANs, reinforcement learning, or meta-learning:
```python
class GANModule(L.LightningModule):
def training_step(self, batch, batch_idx):
optimizer_g, optimizer_d = self.optimizers()
# Train discriminator
real_loss = self.discriminator_loss(batch, real=True)
fake_loss = self.discriminator_loss(batch, real=False)
d_loss = (real_loss + fake_loss) / 2
optimizer_d.zero_grad()
self.manual_backward(d_loss)
optimizer_d.step()
# Train generator
g_loss = self.generator_loss(batch)
optimizer_g.zero_grad()
self.manual_backward(g_loss)
optimizer_g.step()
@property
def automatic_optimization(self):
return False
```
### Profiling and Debugging
Built-in profiling tools help identify bottlenecks:
```python
trainer = L.Trainer(
profiler="advanced", # or "simple", "pytorch"
detect_anomaly=True,
overfit_batches=10, # Quick sanity check
)
```
## Experiment Reproducibility
Lightning has built-in support for reproducibility, which is critical for academic research:
```python
# Seed everything for reproducibility
L.seed_everything(42, workers=True)
# Hyperparameters are automatically saved
model = ResearchModel(input_dim=768, hidden_dim=256, output_dim=10)
# model.hparams is automatically populated and logged
# Checkpoints include full training state
# Resume training from a checkpoint
trainer.fit(model, ckpt_path="path/to/checkpoint.ckpt")
```
The `save_hyperparameters()` call in your module's `__init__` automatically tracks all constructor arguments, making experiment comparison straightforward.
## References
- Repository: https://github.com/Lightning-AI/pytorch-lightning
- Documentation: https://lightning.ai/docs/pytorch/stable/
- Lightning AI platform: https://lightning.ai/
- Migration guide from vanilla PyTorch: https://lightning.ai/docs/pytorch/stable/starter/converting.htmlRelated Skills
thuthesis-guide
Write Tsinghua University theses using the ThuThesis LaTeX template
thesis-writing-guide
Templates, formatting rules, and strategies for thesis and dissertation writing
thesis-template-guide
Set up LaTeX templates for PhD and Master's thesis documents
sjtuthesis-guide
Write SJTU theses using the SJTUThesis LaTeX template with full compliance
novathesis-guide
LaTeX thesis template supporting multiple universities and formats
graphical-abstract-guide
Create SVG graphical abstracts for journal paper submissions
beamer-presentation-guide
Guide to creating academic presentations with LaTeX Beamer
plagiarism-detection-guide
Use plagiarism detection tools and ensure manuscript originality
paper-polish-guide
Review and polish LaTeX research papers for clarity and style
grammar-checker-guide
Use grammar and style checking tools to polish academic manuscripts
conciseness-editing-guide
Eliminate wordiness and redundancy in academic prose for clarity
academic-translation-guide
Academic translation, post-editing, and Chinglish correction guide