llm-from-scratch-guide

Build a ChatGPT-like LLM from scratch using PyTorch step by step

191 stars

Best use case

llm-from-scratch-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Build a ChatGPT-like LLM from scratch using PyTorch step by step

Teams using llm-from-scratch-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/llm-from-scratch-guide/SKILL.md --create-dirs "https://raw.githubusercontent.com/wentorai/research-plugins/main/skills/domains/ai-ml/llm-from-scratch-guide/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/llm-from-scratch-guide/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How llm-from-scratch-guide Compares

Feature / Agent	llm-from-scratch-guide	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Build a ChatGPT-like LLM from scratch using PyTorch step by step

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# LLM From Scratch Guide

## Overview

LLMs-from-scratch is a comprehensive educational repository with over 87,000 stars on GitHub that teaches you how to build a ChatGPT-like large language model from the ground up using PyTorch. Created by Sebastian Raschka, a machine learning researcher and author, the project provides a complete pipeline covering data preparation, tokenization, attention mechanisms, pretraining, and instruction finetuning.

Unlike tutorials that treat LLMs as black boxes, this project demystifies every component by walking through the full implementation. Each chapter corresponds to a Jupyter notebook with clear explanations, diagrams, and runnable code. The repository accompanies the book "Build a Large Language Model (From Scratch)" and serves as a standalone learning resource for researchers and engineers who want deep understanding of transformer-based language models.

The project is particularly valuable for academic researchers who need to understand the internals of LLMs for their own research, whether that involves modifying architectures, running ablation studies, or developing domain-specific language models for scientific applications.

## Installation and Setup

Clone the repository and set up a Python environment with the required dependencies:

```bash
git clone https://github.com/rasbt/LLMs-from-scratch.git
cd LLMs-from-scratch

# Create a virtual environment
python -m venv llm-env
source llm-env/bin/activate

# Install dependencies
pip install -r requirements.txt
```

The project requires Python 3.10+ and PyTorch 2.0+. For GPU-accelerated training, ensure you have CUDA installed. The notebooks can also run on CPU for smaller model configurations, though training times will be significantly longer.

Key dependencies include:

- **PyTorch** >= 2.0 for model implementation and training
- **tiktoken** for BPE tokenization compatible with OpenAI models
- **matplotlib** for training visualization
- **jupyter** for interactive notebook execution

## Core Learning Pipeline

The project is organized into sequential chapters that build on each other:

### Chapter 1: Understanding Large Language Models
Covers the conceptual foundations of LLMs, including the transformer architecture, the difference between encoder and decoder models, and how pretraining and finetuning work at a high level.

### Chapter 2: Working with Text Data
Implements text tokenization from scratch, including byte-pair encoding (BPE). You build a custom tokenizer and learn how text is converted to numerical representations:

```python
# Tokenization example from the project
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
text = "Large language models are fascinating."
token_ids = tokenizer.encode(text)
decoded = tokenizer.decode(token_ids)
```

### Chapter 3: Coding Attention Mechanisms
Implements self-attention, multi-head attention, and causal (masked) attention from scratch. This is the core computational primitive of transformers:

```python
# Simplified multi-head attention
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, num_heads, dropout=0.0):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=False)
        self.W_key = nn.Linear(d_in, d_out, bias=False)
        self.W_value = nn.Linear(d_in, d_out, bias=False)
        self.out_proj = nn.Linear(d_out, d_out)
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads
```

### Chapter 4: Implementing a GPT Model
Assembles the full GPT architecture using the attention mechanism, layer normalization, feed-forward networks, and positional embeddings.

### Chapter 5: Pretraining on Unlabeled Data
Trains the GPT model on a text corpus using next-token prediction. Covers the training loop, loss computation, learning rate scheduling, and gradient clipping.

### Chapter 6: Finetuning for Text Classification
Adapts the pretrained model for downstream classification tasks, demonstrating how to add a classification head and finetune on labeled data.

### Chapter 7: Instruction Finetuning
Converts the pretrained model into an instruction-following assistant using supervised finetuning on instruction-response pairs, similar to how ChatGPT is trained.

## Research Applications

This resource is invaluable for several research scenarios:

- **Architecture ablation studies**: Modify individual components (attention heads, layer count, embedding dimensions) and measure their impact on performance
- **Domain-specific pretraining**: Use the pipeline to pretrain models on scientific corpora (biomedical literature, physics papers, chemical databases)
- **Tokenizer research**: Experiment with different tokenization strategies for specialized vocabularies
- **Efficient training methods**: Test techniques like gradient accumulation, mixed precision, and learning rate warmup
- **Interpretability research**: Inspect attention patterns and intermediate representations at every layer

For researchers working with limited compute, the project includes configurations for small models (124M parameters) that can be trained on a single GPU in reasonable time, making it practical for experimentation and prototyping.

## Integration with Research Workflows

Combine this project with other tools in your research stack:

- Use **Weights & Biases** or **MLflow** for experiment tracking during pretraining runs
- Export trained models to **Hugging Face Hub** for sharing and reproducibility
- Integrate with **PyTorch Lightning** for distributed training across multiple GPUs
- Apply **LoRA** or **QLoRA** adapters from the bonus chapters for parameter-efficient finetuning

The bonus materials in the repository cover additional topics like DPO (Direct Preference Optimization), loading pretrained weights from Hugging Face, and converting models between different formats.

## References

- Repository: https://github.com/rasbt/LLMs-from-scratch
- Book: "Build a Large Language Model (From Scratch)" by Sebastian Raschka (Manning, 2024)
- Author's blog: https://sebastianraschka.com/blog/
- PyTorch documentation: https://pytorch.org/docs/stable/