deep-learning-papers-guide
Annotated deep learning paper implementations with code walkthroughs
Best use case
deep-learning-papers-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Annotated deep learning paper implementations with code walkthroughs
Teams using deep-learning-papers-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/deep-learning-papers-guide/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How deep-learning-papers-guide Compares
| Feature / Agent | deep-learning-papers-guide | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Annotated deep learning paper implementations with code walkthroughs
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Deep Learning Papers Guide
## Overview
Understanding deep learning architectures requires more than reading papers -- it requires reading and writing code. The annotated_deep_learning_paper_implementations repository (65,800+ stars) provides line-by-line annotated implementations of seminal deep learning papers in PyTorch, making it one of the most valuable learning resources in the field.
This guide organizes the key architectures by category, provides implementation patterns for the most important building blocks, and offers strategies for going from paper to working code. Whether you are implementing a Transformer variant for your research, understanding a GAN architecture for your experiments, or teaching a deep learning course, these patterns accelerate the process.
The focus is on practical understanding: what each component does, why it is designed that way, and how to implement it correctly in PyTorch.
## Core Architecture Families
### Transformer Architectures
The Transformer (Vaswani et al., 2017) is the foundation of modern NLP and increasingly of computer vision.
#### Multi-Head Self-Attention
```python
import torch
import torch.nn as nn
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
assert d_model % n_heads == 0
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Linear projections and reshape to (batch, heads, seq, d_k)
Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
context = torch.matmul(attn, V)
# Concatenate heads and project
context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.W_o(context)
```
#### Transformer Encoder Block
```python
class TransformerBlock(nn.Module):
def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, n_heads)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Pre-norm variant (used in GPT-2, ViT, modern architectures)
attn_out = self.attention(self.norm1(x), self.norm1(x), self.norm1(x), mask)
x = x + self.dropout(attn_out)
x = x + self.ffn(self.norm2(x))
return x
```
### Convolutional Neural Networks
#### ResNet Bottleneck Block
```python
class BottleneckBlock(nn.Module):
expansion = 4
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
stride=stride, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, 1, bias=False)
self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
def forward(self, x):
identity = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
if self.downsample is not None:
identity = self.downsample(x)
out += identity
return self.relu(out)
```
## Key Architecture Comparison
| Architecture | Year | Parameters | Key Innovation | Primary Domain |
|-------------|------|------------|----------------|---------------|
| ResNet | 2015 | 25M (ResNet-50) | Skip connections | Vision |
| Transformer | 2017 | Varies | Self-attention | NLP |
| BERT | 2018 | 340M (Large) | Masked language modeling | NLP |
| GPT-2 | 2019 | 1.5B | Autoregressive generation | NLP |
| ViT | 2020 | 86M (Base) | Patch-based image tokenization | Vision |
| Diffusion | 2020 | Varies | Iterative denoising | Generation |
| LLaMA | 2023 | 7B-70B | Efficient open LLM | NLP |
## Training Patterns
### Standard Training Loop
```python
def train_epoch(model, dataloader, optimizer, criterion, device):
model.train()
total_loss = 0
for batch_idx, (data, targets) in enumerate(dataloader):
data, targets = data.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(data)
loss = criterion(outputs, targets)
loss.backward()
# Gradient clipping (crucial for Transformers)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
total_loss += loss.item()
return total_loss / len(dataloader)
```
### Learning Rate Scheduling
```python
# Cosine annealing with warmup (standard for Transformers)
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=50000)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])
```
## From Paper to Code: A Methodology
1. **Read the paper twice.** First pass for high-level understanding; second pass for implementation details.
2. **Identify the core algorithm.** Usually in Section 3 or 4 of the paper.
3. **List all hyperparameters.** Create a config dict before writing any code.
4. **Implement bottom-up.** Start with the smallest building blocks (attention, normalization), then compose.
5. **Test each component in isolation.** Verify tensor shapes and gradients at each level.
6. **Reproduce a known result first.** Match the paper's numbers on a small dataset before scaling.
7. **Use the official implementation as reference.** But write your own code for understanding.
## Best Practices
- **Always verify tensor shapes.** Add assert statements for dimensions during development.
- **Use mixed precision training.** `torch.cuda.amp` provides 2x speedup with minimal accuracy loss.
- **Log everything.** Use Weights & Biases or TensorBoard for experiment tracking.
- **Start small.** Debug on a tiny dataset before running on the full one.
- **Read the appendix.** Critical details (learning rates, initialization, data augmentation) are often in the supplementary material.
- **Join the community.** Papers With Code, Reddit r/MachineLearning, and Twitter/X are where implementation details are discussed.
## References
- [annotated_deep_learning_paper_implementations](https://github.com/labmlai/annotated_deep_learning_paper_implementations) -- Line-by-line annotated implementations (65,800+ stars)
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) -- Original Transformer paper
- [Deep Residual Learning](https://arxiv.org/abs/1512.03385) -- ResNet paper
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) -- Jay Alammar's visual guide
- [Papers With Code](https://paperswithcode.com/) -- Paper-implementation pairs with benchmarksRelated Skills
thuthesis-guide
Write Tsinghua University theses using the ThuThesis LaTeX template
thesis-writing-guide
Templates, formatting rules, and strategies for thesis and dissertation writing
thesis-template-guide
Set up LaTeX templates for PhD and Master's thesis documents
sjtuthesis-guide
Write SJTU theses using the SJTUThesis LaTeX template with full compliance
novathesis-guide
LaTeX thesis template supporting multiple universities and formats
graphical-abstract-guide
Create SVG graphical abstracts for journal paper submissions
beamer-presentation-guide
Guide to creating academic presentations with LaTeX Beamer
plagiarism-detection-guide
Use plagiarism detection tools and ensure manuscript originality
paper-polish-guide
Review and polish LaTeX research papers for clarity and style
grammar-checker-guide
Use grammar and style checking tools to polish academic manuscripts
conciseness-editing-guide
Eliminate wordiness and redundancy in academic prose for clarity
academic-translation-guide
Academic translation, post-editing, and Chinglish correction guide