peft-fine-tuning

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when a user asks to fine-tune a language model, train a custom LLM, adapt a model to their data, use LoRA or QLoRA, fine-tune Llama or Mistral, or train a model on consumer GPUs. Covers PEFT methods for 7B-70B parameter models.

26 stars

Best use case

peft-fine-tuning is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when a user asks to fine-tune a language model, train a custom LLM, adapt a model to their data, use LoRA or QLoRA, fine-tune Llama or Mistral, or train a model on consumer GPUs. Covers PEFT methods for 7B-70B parameter models.

Teams using peft-fine-tuning should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/peft-fine-tuning/SKILL.md --create-dirs "https://raw.githubusercontent.com/TerminalSkills/skills/main/skills/peft-fine-tuning/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/peft-fine-tuning/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How peft-fine-tuning Compares

Feature / Agentpeft-fine-tuningStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when a user asks to fine-tune a language model, train a custom LLM, adapt a model to their data, use LoRA or QLoRA, fine-tune Llama or Mistral, or train a model on consumer GPUs. Covers PEFT methods for 7B-70B parameter models.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# PEFT Fine-Tuning

## Overview

Fine-tune large language models efficiently using Parameter-Efficient Fine-Tuning (PEFT) methods. Train 7B to 70B parameter models on consumer GPUs (16-48 GB VRAM) using LoRA, QLoRA, and 25+ adapter methods from the Hugging Face PEFT library. Avoid the cost and hardware requirements of full fine-tuning while achieving comparable results.

## Instructions

When a user asks to fine-tune a model, determine the approach:

### Task A: Set up the environment

```bash
pip install torch transformers datasets peft accelerate bitsandbytes trl
# For Flash Attention 2 (recommended for speed)
pip install flash-attn --no-build-isolation
```

Verify GPU availability:
```python
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
```

### Task B: Fine-tune with LoRA (16+ GB VRAM)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainer

# 1. Load base model
model_name = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,                          # Rank (8-64; higher = more capacity)
    lora_alpha=32,                 # Scaling factor (usually 2x rank)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13.6M || all params: 8.03B || 0.17%

# 3. Load and format dataset
dataset = load_dataset("your-dataset")

def format_prompt(example):
    return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"

# 4. Train
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    formatting_func=format_prompt,
    max_seq_length=2048,
)

trainer.train()
trainer.save_model("./lora-adapter")
```

### Task C: Fine-tune with QLoRA (8+ GB VRAM)

QLoRA quantizes the base model to 4-bit, dramatically reducing memory:

```python
from transformers import BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# Apply LoRA on top of quantized model
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
# Now fine-tune with the same SFTTrainer setup from Task B
```

VRAM requirements with QLoRA:
- 7B model: ~6 GB
- 13B model: ~10 GB
- 70B model: ~36 GB

### Task D: Merge and export the fine-tuned model

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./lora-adapter")

# Merge adapter weights into base model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.save_pretrained("./merged-model")

# Convert to GGUF for Ollama/llama.cpp
# pip install llama-cpp-python
# python -m llama_cpp.convert ./merged-model --outfile model.gguf
```

### Task E: Prepare a custom dataset

```python
from datasets import Dataset
import json

# Format: instruction-response pairs
data = [
    {"instruction": "Summarize this contract clause.", "input": "...", "output": "..."},
    {"instruction": "Extract the key dates.", "input": "...", "output": "..."},
]

# Create Hugging Face dataset
dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.1)

# Or load from JSONL file
dataset = load_dataset("json", data_files="training_data.jsonl")
```

## Examples

### Example 1: Fine-tune Llama 3.1 8B for customer support

**User request:** "Fine-tune Llama 8B on our support ticket data"

```python
# Format support tickets as instruction pairs
def format_support(example):
    return (
        f"### Customer Query:\n{example['question']}\n\n"
        f"### Support Response:\n{example['answer']}"
    )

# Use QLoRA for 8GB VRAM GPUs
# Train for 3 epochs with lr=2e-4, rank=16
# Result: ~2 hours on RTX 4090, adapter size ~30 MB
```

### Example 2: Domain-adapt a model for medical text

**User request:** "Adapt Mistral 7B to understand medical terminology"

Use continued pre-training with LoRA on a medical corpus, then instruction-tune on medical QA pairs. Set `r=32` for higher capacity on specialized domains.

### Example 3: Fine-tune a 70B model with QLoRA on 2x A100

**User request:** "Fine-tune Llama 70B on our internal documents"

Use QLoRA with `device_map="auto"` to shard across GPUs. Set `per_device_train_batch_size=1` with `gradient_accumulation_steps=16`. Expect ~24 hours for 3 epochs on 10K samples.

## Guidelines

- Start with QLoRA if VRAM is limited; it matches LoRA quality in most benchmarks.
- Use rank `r=16` as a default. Increase to `r=32-64` for complex domain adaptation; decrease to `r=8` for simple style tuning.
- Always set `lora_alpha = 2 * r` as a starting point.
- Target all linear layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`) for best results.
- Use a cosine learning rate scheduler with 3% warmup for stable training.
- Monitor training loss: it should decrease steadily. If it plateaus early, increase rank or learning rate.
- Evaluate on a held-out test set after each epoch to detect overfitting.
- Save checkpoints every epoch; adapter files are small (~30-100 MB).
- Clean, well-formatted training data matters more than quantity. 1,000 high-quality examples often beat 10,000 noisy ones.

Related Skills

refine

26
from TerminalSkills/skills

Build data-intensive React applications with Refine. Use when a user asks to create admin panels, dashboards, or CRUD interfaces using Refine with REST, GraphQL, or custom data providers and Ant Design or Material UI.

kaggle-finetune

26
from TerminalSkills/skills

End-to-end workflow for fine-tuning LLMs using Kaggle datasets. Use when downloading datasets from Kaggle for model training, preparing conversation/customer service data for chatbot fine-tuning, or building domain-specific AI assistants. Covers dataset discovery, download, preprocessing into chat format, and integration with PEFT/LoRA training.

idea-refine

26
from TerminalSkills/skills

Refines raw ideas into sharp, actionable concepts through structured divergent and convergent thinking. Use when you have a vague idea that needs sharpening, want to stress-test a plan, or need to explore variations before committing to a direction. Produces a concrete markdown one-pager with problem statement, assumptions, MVP scope, and trade-offs.

zustand

26
from TerminalSkills/skills

You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.

zoho

26
from TerminalSkills/skills

Integrate and automate Zoho products. Use when a user asks to work with Zoho CRM, Zoho Books, Zoho Desk, Zoho Projects, Zoho Mail, or Zoho Creator, build custom integrations via Zoho APIs, automate workflows with Deluge scripting, sync data between Zoho apps and external systems, manage leads and deals, automate invoicing, build custom Zoho Creator apps, set up webhooks, or manage Zoho organization settings. Covers Zoho CRM, Books, Desk, Projects, Creator, and cross-product integrations.

zod

26
from TerminalSkills/skills

You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.

zipkin

26
from TerminalSkills/skills

Deploy and configure Zipkin for distributed tracing and request flow visualization. Use when a user needs to set up trace collection, instrument Java/Spring or other services with Zipkin, analyze service dependencies, or configure storage backends for trace data.

zig

26
from TerminalSkills/skills

Expert guidance for Zig, the systems programming language focused on performance, safety, and readability. Helps developers write high-performance code with compile-time evaluation, seamless C interop, no hidden control flow, and no garbage collector. Zig is used for game engines, operating systems, networking, and as a C/C++ replacement.

zed

26
from TerminalSkills/skills

Expert guidance for Zed, the high-performance code editor built in Rust with native collaboration, AI integration, and GPU-accelerated rendering. Helps developers configure Zed, create custom extensions, set up collaborative editing sessions, and integrate AI assistants for productive coding.

zeabur

26
from TerminalSkills/skills

Expert guidance for Zeabur, the cloud deployment platform that auto-detects frameworks, builds and deploys applications with zero configuration, and provides managed services like databases and message queues. Helps developers deploy full-stack applications with automatic scaling and one-click marketplace services.

zapier

26
from TerminalSkills/skills

Automate workflows between apps with Zapier. Use when a user asks to connect apps without code, automate repetitive tasks, sync data between services, or build no-code integrations between SaaS tools.

zabbix

26
from TerminalSkills/skills

Configure Zabbix for enterprise infrastructure monitoring with templates, triggers, discovery rules, and dashboards. Use when a user needs to set up Zabbix server, configure host monitoring, create custom templates, define trigger expressions, or automate host discovery and registration.