deep-learning-papers-guide

Annotated deep learning paper implementations with code walkthroughs

191 stars

Best use case

deep-learning-papers-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Annotated deep learning paper implementations with code walkthroughs

Teams using deep-learning-papers-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/deep-learning-papers-guide/SKILL.md --create-dirs "https://raw.githubusercontent.com/wentorai/research-plugins/main/skills/domains/ai-ml/deep-learning-papers-guide/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/deep-learning-papers-guide/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How deep-learning-papers-guide Compares

Feature / Agent	deep-learning-papers-guide	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Annotated deep learning paper implementations with code walkthroughs

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Deep Learning Papers Guide

## Overview

Understanding deep learning architectures requires more than reading papers -- it requires reading and writing code. The annotated_deep_learning_paper_implementations repository (65,800+ stars) provides line-by-line annotated implementations of seminal deep learning papers in PyTorch, making it one of the most valuable learning resources in the field.

This guide organizes the key architectures by category, provides implementation patterns for the most important building blocks, and offers strategies for going from paper to working code. Whether you are implementing a Transformer variant for your research, understanding a GAN architecture for your experiments, or teaching a deep learning course, these patterns accelerate the process.

The focus is on practical understanding: what each component does, why it is designed that way, and how to implement it correctly in PyTorch.

## Core Architecture Families

### Transformer Architectures

The Transformer (Vaswani et al., 2017) is the foundation of modern NLP and increasingly of computer vision.

#### Multi-Head Self-Attention

```python
import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections and reshape to (batch, heads, seq, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = torch.softmax(scores, dim=-1)
        context = torch.matmul(attn, V)

        # Concatenate heads and project
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(context)
```

#### Transformer Encoder Block

```python
class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-norm variant (used in GPT-2, ViT, modern architectures)
        attn_out = self.attention(self.norm1(x), self.norm1(x), self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        x = x + self.ffn(self.norm2(x))
        return x
```

### Convolutional Neural Networks

#### ResNet Bottleneck Block

```python
class BottleneckBlock(nn.Module):
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        if self.downsample is not None:
            identity = self.downsample(x)
        out += identity
        return self.relu(out)
```

## Key Architecture Comparison

| Architecture | Year | Parameters | Key Innovation | Primary Domain |
|-------------|------|------------|----------------|---------------|
| ResNet | 2015 | 25M (ResNet-50) | Skip connections | Vision |
| Transformer | 2017 | Varies | Self-attention | NLP |
| BERT | 2018 | 340M (Large) | Masked language modeling | NLP |
| GPT-2 | 2019 | 1.5B | Autoregressive generation | NLP |
| ViT | 2020 | 86M (Base) | Patch-based image tokenization | Vision |
| Diffusion | 2020 | Varies | Iterative denoising | Generation |
| LLaMA | 2023 | 7B-70B | Efficient open LLM | NLP |

## Training Patterns

### Standard Training Loop

```python
def train_epoch(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for batch_idx, (data, targets) in enumerate(dataloader):
        data, targets = data.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, targets)
        loss.backward()

        # Gradient clipping (crucial for Transformers)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        total_loss += loss.item()

    return total_loss / len(dataloader)
```

### Learning Rate Scheduling

```python
# Cosine annealing with warmup (standard for Transformers)
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=50000)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])
```

## From Paper to Code: A Methodology

1. **Read the paper twice.** First pass for high-level understanding; second pass for implementation details.
2. **Identify the core algorithm.** Usually in Section 3 or 4 of the paper.
3. **List all hyperparameters.** Create a config dict before writing any code.
4. **Implement bottom-up.** Start with the smallest building blocks (attention, normalization), then compose.
5. **Test each component in isolation.** Verify tensor shapes and gradients at each level.
6. **Reproduce a known result first.** Match the paper's numbers on a small dataset before scaling.
7. **Use the official implementation as reference.** But write your own code for understanding.

## Best Practices

- **Always verify tensor shapes.** Add assert statements for dimensions during development.
- **Use mixed precision training.** `torch.cuda.amp` provides 2x speedup with minimal accuracy loss.
- **Log everything.** Use Weights & Biases or TensorBoard for experiment tracking.
- **Start small.** Debug on a tiny dataset before running on the full one.
- **Read the appendix.** Critical details (learning rates, initialization, data augmentation) are often in the supplementary material.
- **Join the community.** Papers With Code, Reddit r/MachineLearning, and Twitter/X are where implementation details are discussed.

## References

- [annotated_deep_learning_paper_implementations](https://github.com/labmlai/annotated_deep_learning_paper_implementations) -- Line-by-line annotated implementations (65,800+ stars)
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) -- Original Transformer paper
- [Deep Residual Learning](https://arxiv.org/abs/1512.03385) -- ResNet paper
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) -- Jay Alammar's visual guide
- [Papers With Code](https://paperswithcode.com/) -- Paper-implementation pairs with benchmarks