Knowledge Distillation: Compressing LLMs
## When to Use This Skill
Best use case
Knowledge Distillation: Compressing LLMs is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
## When to Use This Skill
Teams using Knowledge Distillation: Compressing LLMs should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/knowledge-distillation/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Knowledge Distillation: Compressing LLMs Compares
| Feature / Agent | Knowledge Distillation: Compressing LLMs | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
## When to Use This Skill
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Knowledge Distillation: Compressing LLMs
## When to Use This Skill
Use Knowledge Distillation when you need to:
- **Compress models** from 70B → 7B while retaining 90%+ performance
- **Transfer capabilities** from proprietary models (GPT-4) to open-source (LLaMA, Mistral)
- **Reduce inference costs** by deploying smaller student models
- **Create specialized models** by distilling domain-specific knowledge
- **Improve small models** using synthetic data from large teachers
**Key Techniques**: Temperature scaling, soft targets, reverse KLD (MiniLLM), logit distillation, response distillation
**Papers**: Hinton et al. 2015 (arXiv 1503.02531), MiniLLM (arXiv 2306.08543), KD Survey (arXiv 2402.13116)
## Installation
```bash
# Standard transformers
pip install transformers datasets accelerate
# For training
pip install torch deepspeed wandb
# Optional: MiniLLM implementation
git clone https://github.com/microsoft/LMOps
cd LMOps/minillm
pip install -e .
```
## Quick Start
### Basic Knowledge Distillation
```python
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
# 1. Load teacher (large) and student (small) models
teacher = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf", # Large teacher
torch_dtype=torch.float16,
device_map="auto"
)
student = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf", # Small student
torch_dtype=torch.float16,
device_map="cuda:0"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")
# 2. Define distillation loss
def distillation_loss(student_logits, teacher_logits, labels, temperature=2.0, alpha=0.5):
"""
Combine hard loss (cross-entropy) with soft loss (KL divergence).
Args:
temperature: Softens probability distributions (higher = softer)
alpha: Weight for distillation loss (1-alpha for hard loss)
"""
# Hard loss: Standard cross-entropy with true labels
hard_loss = F.cross_entropy(student_logits.view(-1, student_logits.size(-1)), labels.view(-1))
# Soft loss: KL divergence between student and teacher
soft_targets = F.softmax(teacher_logits / temperature, dim=-1)
soft_student = F.log_softmax(student_logits / temperature, dim=-1)
soft_loss = F.kl_div(soft_student, soft_targets, reduction='batchmean') * (temperature ** 2)
# Combined loss
return alpha * soft_loss + (1 - alpha) * hard_loss
# 3. Training loop
for batch in dataloader:
# Teacher forward (no grad)
with torch.no_grad():
teacher_outputs = teacher(**batch)
teacher_logits = teacher_outputs.logits
# Student forward
student_outputs = student(**batch)
student_logits = student_outputs.logits
# Compute distillation loss
loss = distillation_loss(
student_logits,
teacher_logits,
batch['labels'],
temperature=2.0,
alpha=0.7 # 70% soft, 30% hard
)
# Backward and optimize
loss.backward()
optimizer.step()
optimizer.zero_grad()
```
### MiniLLM (Reverse KLD)
**Source**: arXiv 2306.08543 (2024)
**Innovation**: Use reverse KLD instead of forward KLD for better generative model distillation.
```python
def reverse_kl_loss(student_logits, teacher_logits, temperature=1.0):
"""
Reverse KL divergence: KL(Teacher || Student)
Better for generative models than forward KL.
"""
# Teacher distribution (target)
p_teacher = F.softmax(teacher_logits / temperature, dim=-1)
# Student distribution (model)
log_p_student = F.log_softmax(student_logits / temperature, dim=-1)
# Reverse KL: Sum over teacher, student learns to cover teacher's modes
reverse_kl = -(p_teacher * log_p_student).sum(dim=-1).mean()
return reverse_kl * (temperature ** 2)
# Training with MiniLLM
for batch in dataloader:
with torch.no_grad():
teacher_logits = teacher(**batch).logits
student_logits = student(**batch).logits
# Reverse KLD (better for generation)
loss = reverse_kl_loss(student_logits, teacher_logits, temperature=1.0)
loss.backward()
optimizer.step()
```
**Why reverse KL?**
- **Forward KL** (standard): Student learns to match teacher's *mean*
- **Reverse KL** (MiniLLM): Student learns to *cover* all teacher's modes
- Better for diverse text generation
### Response Distillation
```python
# Generate synthetic data from teacher, train student to imitate
# 1. Generate synthetic responses from teacher
prompts = ["Explain AI:", "What is ML?", "Define NLP:"]
teacher_responses = []
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors='pt').to(teacher.device)
outputs = teacher.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
teacher_responses.append(response)
# 2. Train student on teacher's responses (standard fine-tuning)
train_dataset = [
{"text": f"{prompt}\n{response}"}
for prompt, response in zip(prompts, teacher_responses)
]
# 3. Fine-tune student
trainer = Trainer(
model=student,
args=TrainingArguments(output_dir="./student", num_train_epochs=3, learning_rate=2e-5),
train_dataset=train_dataset,
)
trainer.train()
```
## Core Concepts
### 1. Temperature Scaling
**Purpose**: Soften probability distributions to expose teacher's uncertainty.
```python
# Low temperature (T=1): Sharp distribution
logits = [3.0, 2.0, 1.0]
probs_T1 = softmax(logits / 1.0) # [0.67, 0.24, 0.09]
# High temperature (T=4): Soft distribution
probs_T4 = softmax(logits / 4.0) # [0.42, 0.34, 0.24]
# Higher T reveals more information about relative rankings
```
**Rule**: Use T=2-5 for distillation (2 is common default).
### 2. Loss Function Components
```python
# Total loss = alpha * soft_loss + (1 - alpha) * hard_loss
# Soft loss: Learn from teacher's knowledge
soft_loss = KL(student || teacher)
# Hard loss: Learn from ground truth labels
hard_loss = CrossEntropy(student_output, true_labels)
# Typical values:
alpha = 0.5 # Balanced
alpha = 0.7 # More emphasis on teacher
alpha = 0.3 # More emphasis on labels
```
### 3. Forward vs Reverse KLD
```python
# Forward KL: KL(Student || Teacher)
# - Student matches teacher's average behavior
# - Mode-seeking: Student focuses on teacher's highest probability modes
# - Good for classification
# Reverse KL: KL(Teacher || Student)
# - Student covers all of teacher's behaviors
# - Mode-covering: Student learns diverse behaviors
# - Good for generation (MiniLLM)
```
## Training Strategies
### Strategy 1: Logit Distillation
```python
# Train student to match teacher's logits directly
def logit_distillation_trainer(student, teacher, dataloader, temperature=2.0):
optimizer = torch.optim.AdamW(student.parameters(), lr=2e-5)
for epoch in range(3):
for batch in dataloader:
# Get logits
with torch.no_grad():
teacher_logits = teacher(**batch).logits
student_logits = student(**batch).logits
# MSE on logits (alternative to KLD)
loss = F.mse_loss(student_logits, teacher_logits)
# Or use KLD
# loss = F.kl_div(
# F.log_softmax(student_logits/temperature, dim=-1),
# F.softmax(teacher_logits/temperature, dim=-1),
# reduction='batchmean'
# ) * (temperature ** 2)
loss.backward()
optimizer.step()
optimizer.zero_grad()
return student
```
### Strategy 2: Two-Stage Distillation
```python
# Stage 1: Distill from teacher
student = distill(teacher, student, epochs=5)
# Stage 2: Fine-tune on task-specific data
student = fine_tune(student, task_data, epochs=3)
# Results in better task performance than single-stage
```
### Strategy 3: Multi-Teacher Distillation
```python
# Learn from multiple expert teachers
def multi_teacher_distillation(student, teachers, batch):
"""Distill from ensemble of teachers."""
teacher_logits_list = []
# Get logits from all teachers
with torch.no_grad():
for teacher in teachers:
logits = teacher(**batch).logits
teacher_logits_list.append(logits)
# Average teacher predictions
avg_teacher_logits = torch.stack(teacher_logits_list).mean(dim=0)
# Student learns from ensemble
student_logits = student(**batch).logits
loss = F.kl_div(
F.log_softmax(student_logits, dim=-1),
F.softmax(avg_teacher_logits, dim=-1),
reduction='batchmean'
)
return loss
```
## Production Deployment
### Complete Training Script
```python
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
def train_distilled_model(
teacher_name="meta-llama/Llama-2-70b-hf",
student_name="meta-llama/Llama-2-7b-hf",
output_dir="./distilled-llama-7b",
temperature=2.0,
alpha=0.7,
):
# Load models
teacher = AutoModelForCausalLM.from_pretrained(teacher_name, torch_dtype=torch.float16, device_map="auto")
student = AutoModelForCausalLM.from_pretrained(student_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(teacher_name)
# Custom trainer with distillation
class DistillationTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
# Student forward
outputs_student = model(**inputs)
student_logits = outputs_student.logits
# Teacher forward (no grad)
with torch.no_grad():
outputs_teacher = teacher(**inputs)
teacher_logits = outputs_teacher.logits
# Distillation loss
soft_targets = F.softmax(teacher_logits / temperature, dim=-1)
soft_student = F.log_softmax(student_logits / temperature, dim=-1)
soft_loss = F.kl_div(soft_student, soft_targets, reduction='batchmean') * (temperature ** 2)
# Hard loss
hard_loss = outputs_student.loss
# Combined
loss = alpha * soft_loss + (1 - alpha) * hard_loss
return (loss, outputs_student) if return_outputs else loss
# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-5,
warmup_steps=500,
logging_steps=100,
save_steps=1000,
bf16=True,
gradient_checkpointing=True,
)
# Train
trainer = DistillationTrainer(
model=student,
args=training_args,
train_dataset=train_dataset,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
student.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
# Usage
train_distilled_model(
teacher_name="meta-llama/Llama-2-70b-hf",
student_name="meta-llama/Llama-2-7b-hf",
temperature=2.0,
alpha=0.7
)
```
## Best Practices
### 1. Hyperparameter Selection
```python
# Temperature
T = 1.0 # Sharp (less knowledge transfer)
T = 2.0 # Standard (good balance)
T = 5.0 # Soft (more knowledge transfer)
# Alpha (weight)
alpha = 0.5 # Balanced
alpha = 0.7 # Emphasize teacher knowledge
alpha = 0.9 # Strong distillation
# Rule: Higher T + higher alpha = stronger distillation
```
### 2. Model Size Ratio
```python
# Good ratios (teacher/student)
70B / 7B = 10× # Excellent
13B / 1B = 13× # Good
7B / 1B = 7× # Acceptable
# Avoid too large gap
70B / 1B = 70× # Too large, ineffective
```
### 3. Data Quality
```python
# Best: Use teacher-generated data + real data
train_data = {
"teacher_generated": 70%, # Diverse, high-quality
"real_data": 30% # Ground truth
}
# Avoid: Only real data (doesn't utilize teacher fully)
```
## Evaluation
```python
from transformers import pipeline
# Compare student vs teacher
teacher_pipe = pipeline("text-generation", model=teacher)
student_pipe = pipeline("text-generation", model=student)
prompts = ["Explain quantum computing:", "What is AI?"]
for prompt in prompts:
teacher_out = teacher_pipe(prompt, max_new_tokens=100)
student_out = student_pipe(prompt, max_new_tokens=100)
print(f"Prompt: {prompt}")
print(f"Teacher: {teacher_out[0]['generated_text']}")
print(f"Student: {student_out[0]['generated_text']}")
print(f"Match quality: {calculate_similarity(teacher_out, student_out):.2f}")
```
## Resources
- **Hinton et al. 2015 (Foundational)**: https://arxiv.org/abs/1503.02531
- **MiniLLM (Reverse KLD)**: https://arxiv.org/abs/2306.08543
- **KD Survey for LLMs (2024)**: https://arxiv.org/abs/2402.13116
- **MiniLLM GitHub**: https://github.com/microsoft/LMOps/tree/main/minillmRelated Skills
update-llms
Update the llms.txt file in the root folder to reflect changes in documentation or specifications following the llms.txt specification at https://llmstxt.org/
create-llms
Create an llms.txt file from scratch based on repository structure following the llms.txt specification at https://llmstxt.org/
knowledge-base
专业的知识库管理系统,旨在解决“知识诅咒”和认知偏差问题。通过显式化隐性知识、扫描代码提取领域概念、整合行业最佳实践,构建结构化的 Markdown 知识库。
notion-knowledge-capture
Capture conversations and decisions into structured Notion pages; use when turning chats/notes into wiki entries, how-tos, decisions, or FAQs with proper linking.
recursive-knowledge
Process large document corpora (1000+ docs, millions of tokens) through knowledge graph construction and stateful multi-hop reasoning. Use when (1) User provides a large corpus exceeding context limits, (2) Questions require connections across multiple documents, (3) Multi-hop reasoning needed for complex queries, (4) User wants persistent queryable knowledge from documents. Replaces brute-force document stuffing with intelligent graph traversal.
project-knowledge
Load project architecture and structure knowledge. Use when you need to understand how this project is organized.
reconnaissance-knowledge
Comprehensive knowledge about network reconnaissance and service enumeration. Provides methodologies for port scanning, service fingerprinting, web directory discovery, and vulnerability identification. Includes best practices for structured data collection.
privilege-escalation-knowledge
Comprehensive knowledge about Linux privilege escalation. Provides methodologies for enumerating and exploiting privesc vectors including SUID binaries, sudo permissions, capabilities, kernel exploits, cron jobs, and common misconfigurations. Includes systematic approach to capturing root flags.
exploitation-knowledge
Comprehensive knowledge about vulnerability exploitation and initial access. Provides expertise on finding and adapting exploits, adapting proof-of-concepts, gaining shells, and capturing user flags. Covers reverse shells, file uploads, SQL injection, and RCE vulnerabilities.
knowledge
Display knowledge base status and recent learnings
Ollama — Run LLMs Locally
You are an expert in Ollama, the tool for running open-source LLMs locally. You help developers run Llama, Mistral, Gemma, Phi, CodeLlama, and other models on their machine with a simple CLI and REST API — enabling private AI development, offline inference, fine-tuning experiments, and cost-free prototyping without sending data to cloud APIs.
DSPy — Programming (Not Prompting) LLMs
You are an expert in DSPy, the Stanford framework that replaces prompt engineering with programming. You help developers define LLM tasks as typed signatures, compose them into modules, and automatically optimize prompts/few-shot examples using teleprompters — so instead of manually crafting prompts, you write Python code and DSPy finds the best prompts for your task.