PEFT Fine-Tuning

## Overview

25 stars

Best use case

PEFT Fine-Tuning is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

## Overview

Teams using PEFT Fine-Tuning should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/peft-fine-tuning/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/peft-fine-tuning/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/peft-fine-tuning/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How PEFT Fine-Tuning Compares

Feature / AgentPEFT Fine-TuningStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

## Overview

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# PEFT Fine-Tuning

## Overview

Fine-tune large language models efficiently using Parameter-Efficient Fine-Tuning (PEFT) methods. Train 7B to 70B parameter models on consumer GPUs (16-48 GB VRAM) using LoRA, QLoRA, and 25+ adapter methods from the Hugging Face PEFT library. Avoid the cost and hardware requirements of full fine-tuning while achieving comparable results.

## Instructions

When a user asks to fine-tune a model, determine the approach:

### Task A: Set up the environment

```bash
pip install torch transformers datasets peft accelerate bitsandbytes trl
# For Flash Attention 2 (recommended for speed)
pip install flash-attn --no-build-isolation
```

Verify GPU availability:
```python
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
```

### Task B: Fine-tune with LoRA (16+ GB VRAM)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainer

# 1. Load base model
model_name = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,                          # Rank (8-64; higher = more capacity)
    lora_alpha=32,                 # Scaling factor (usually 2x rank)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13.6M || all params: 8.03B || 0.17%

# 3. Load and format dataset
dataset = load_dataset("your-dataset")

def format_prompt(example):
    return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"

# 4. Train
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    formatting_func=format_prompt,
    max_seq_length=2048,
)

trainer.train()
trainer.save_model("./lora-adapter")
```

### Task C: Fine-tune with QLoRA (8+ GB VRAM)

QLoRA quantizes the base model to 4-bit, dramatically reducing memory:

```python
from transformers import BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# Apply LoRA on top of quantized model
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
# Now fine-tune with the same SFTTrainer setup from Task B
```

VRAM requirements with QLoRA:
- 7B model: ~6 GB
- 13B model: ~10 GB
- 70B model: ~36 GB

### Task D: Merge and export the fine-tuned model

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./lora-adapter")

# Merge adapter weights into base model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.save_pretrained("./merged-model")

# Convert to GGUF for Ollama/llama.cpp
# pip install llama-cpp-python
# python -m llama_cpp.convert ./merged-model --outfile model.gguf
```

### Task E: Prepare a custom dataset

```python
from datasets import Dataset
import json

# Format: instruction-response pairs
data = [
    {"instruction": "Summarize this contract clause.", "input": "...", "output": "..."},
    {"instruction": "Extract the key dates.", "input": "...", "output": "..."},
]

# Create Hugging Face dataset
dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.1)

# Or load from JSONL file
dataset = load_dataset("json", data_files="training_data.jsonl")
```

## Examples

### Example 1: Fine-tune Llama 3.1 8B for customer support

**User request:** "Fine-tune Llama 8B on our support ticket data"

```python
# Format support tickets as instruction pairs
def format_support(example):
    return (
        f"### Customer Query:\n{example['question']}\n\n"
        f"### Support Response:\n{example['answer']}"
    )

# Use QLoRA for 8GB VRAM GPUs
# Train for 3 epochs with lr=2e-4, rank=16
# Result: ~2 hours on RTX 4090, adapter size ~30 MB
```

### Example 2: Domain-adapt a model for medical text

**User request:** "Adapt Mistral 7B to understand medical terminology"

Use continued pre-training with LoRA on a medical corpus, then instruction-tune on medical QA pairs. Set `r=32` for higher capacity on specialized domains.

### Example 3: Fine-tune a 70B model with QLoRA on 2x A100

**User request:** "Fine-tune Llama 70B on our internal documents"

Use QLoRA with `device_map="auto"` to shard across GPUs. Set `per_device_train_batch_size=1` with `gradient_accumulation_steps=16`. Expect ~24 hours for 3 epochs on 10K samples.

## Guidelines

- Start with QLoRA if VRAM is limited; it matches LoRA quality in most benchmarks.
- Use rank `r=16` as a default. Increase to `r=32-64` for complex domain adaptation; decrease to `r=8` for simple style tuning.
- Always set `lora_alpha = 2 * r` as a starting point.
- Target all linear layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`) for best results.
- Use a cosine learning rate scheduler with 3% warmup for stable training.
- Monitor training loss: it should decrease steadily. If it plateaus early, increase rank or learning rate.
- Evaluate on a held-out test set after each epoch to detect overfitting.
- Save checkpoints every epoch; adapter files are small (~30-100 MB).
- Clean, well-formatted training data matters more than quantity. 1,000 high-quality examples often beat 10,000 noisy ones.

Related Skills

tuning-hyperparameters

25
from ComeOnOliver/skillshub

Optimize machine learning model hyperparameters using grid search, random search, or Bayesian optimization. Finds best parameter configurations to maximize performance. Use when asked to "tune hyperparameters" or "optimize model". Trigger with relevant phrases based on skill purpose.

fathom-cost-tuning

25
from ComeOnOliver/skillshub

Optimize Fathom API usage and plan selection. Trigger with phrases like "fathom cost", "fathom pricing", "fathom plan".

exa-performance-tuning

25
from ComeOnOliver/skillshub

Optimize Exa API performance with search type selection, caching, and parallelization. Use when experiencing slow responses, implementing caching strategies, or optimizing request throughput for Exa integrations. Trigger with phrases like "exa performance", "optimize exa", "exa latency", "exa caching", "exa slow", "exa fast".

evernote-performance-tuning

25
from ComeOnOliver/skillshub

Optimize Evernote integration performance. Use when improving response times, reducing API calls, or scaling Evernote integrations. Trigger with phrases like "evernote performance", "optimize evernote", "evernote speed", "evernote caching".

evernote-cost-tuning

25
from ComeOnOliver/skillshub

Optimize Evernote integration costs and resource usage. Use when managing API quotas, reducing storage usage, or optimizing upload limits. Trigger with phrases like "evernote cost", "evernote quota", "evernote limits", "evernote upload".

elevenlabs-performance-tuning

25
from ComeOnOliver/skillshub

Optimize ElevenLabs TTS latency with model selection, streaming, caching, and audio format tuning. Use when experiencing slow TTS responses, implementing real-time voice features, or optimizing audio generation throughput. Trigger: "elevenlabs performance", "optimize elevenlabs", "elevenlabs latency", "elevenlabs slow", "fast TTS", "reduce elevenlabs latency", "TTS streaming".

documenso-performance-tuning

25
from ComeOnOliver/skillshub

Optimize Documenso integration performance with caching, batching, and efficient patterns. Use when improving response times, reducing API calls, or optimizing bulk document operations. Trigger with phrases like "documenso performance", "optimize documenso", "documenso caching", "documenso batch operations".

documenso-cost-tuning

25
from ComeOnOliver/skillshub

Optimize Documenso usage costs and manage subscription efficiency. Use when analyzing costs, optimizing document usage, or managing Documenso subscription tiers. Trigger with phrases like "documenso costs", "documenso pricing", "optimize documenso spending", "documenso usage".

deepgram-performance-tuning

25
from ComeOnOliver/skillshub

Optimize Deepgram API performance for faster transcription and lower latency. Use when improving transcription speed, reducing latency, or optimizing audio processing pipelines. Trigger: "deepgram performance", "speed up deepgram", "optimize transcription", "deepgram latency", "deepgram faster", "deepgram throughput".

deepgram-cost-tuning

25
from ComeOnOliver/skillshub

Optimize Deepgram costs and usage for budget-conscious deployments. Use when reducing transcription costs, implementing usage controls, or optimizing pricing tier utilization. Trigger: "deepgram cost", "reduce deepgram spending", "deepgram pricing", "deepgram budget", "optimize deepgram usage", "deepgram billing".

databricks-performance-tuning

25
from ComeOnOliver/skillshub

Optimize Databricks cluster and query performance. Use when jobs are running slowly, optimizing Spark configurations, or improving Delta Lake query performance. Trigger with phrases like "databricks performance", "spark tuning", "databricks slow", "optimize databricks", "cluster performance".

databricks-cost-tuning

25
from ComeOnOliver/skillshub

Optimize Databricks costs with cluster policies, spot instances, and monitoring. Use when reducing cloud spend, implementing cost controls, or analyzing Databricks usage costs. Trigger with phrases like "databricks cost", "reduce databricks spend", "databricks billing", "databricks cost optimization", "cluster cost".