Model Merging: Combining Pre-trained Models
## When to Use This Skill
Best use case
Model Merging: Combining Pre-trained Models is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
## When to Use This Skill
Teams using Model Merging: Combining Pre-trained Models should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/model-merging/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Model Merging: Combining Pre-trained Models Compares
| Feature / Agent | Model Merging: Combining Pre-trained Models | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
## When to Use This Skill
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Model Merging: Combining Pre-trained Models
## When to Use This Skill
Use Model Merging when you need to:
- **Combine capabilities** from multiple fine-tuned models without retraining
- **Create specialized models** by blending domain-specific expertise (math + coding + chat)
- **Improve performance** beyond single models (often +5-10% on benchmarks)
- **Reduce training costs** - no GPUs needed, merges run on CPU
- **Experiment rapidly** - create new model variants in minutes, not days
- **Preserve multiple skills** - merge without catastrophic forgetting
**Success Stories**: Marcoro14-7B-slerp (best on Open LLM Leaderboard 02/2024), many top HuggingFace models use merging
**Tools**: mergekit (Arcee AI), LazyMergekit, Model Soup
## Installation
```bash
# Install mergekit
git clone https://github.com/arcee-ai/mergekit.git
cd mergekit
pip install -e .
# Or via pip
pip install mergekit
# Optional: Transformer library
pip install transformers torch
```
## Quick Start
### Simple Linear Merge
```yaml
# config.yml - Merge two models with equal weights
merge_method: linear
models:
- model: mistralai/Mistral-7B-v0.1
parameters:
weight: 0.5
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
weight: 0.5
dtype: bfloat16
```
```bash
# Run merge
mergekit-yaml config.yml ./merged-model --cuda
# Use merged model
python -m transformers.models.auto --model_name_or_path ./merged-model
```
### SLERP Merge (Best for 2 Models)
```yaml
# config.yml - Spherical interpolation
merge_method: slerp
slices:
- sources:
- model: mistralai/Mistral-7B-v0.1
layer_range: [0, 32]
- model: teknium/OpenHermes-2.5-Mistral-7B
layer_range: [0, 32]
parameters:
t: 0.5 # Interpolation factor (0=model1, 1=model2)
dtype: bfloat16
```
## Core Concepts
### 1. Merge Methods
**Linear (Model Soup)**
- Simple weighted average of parameters
- Fast, works well for similar models
- Can merge 2+ models
```python
merged_weights = w1 * model1_weights + w2 * model2_weights + w3 * model3_weights
# where w1 + w2 + w3 = 1
```
**SLERP (Spherical Linear Interpolation)**
- Interpolates along sphere in weight space
- Preserves magnitude of weight vectors
- Best for merging 2 models
- Smoother than linear
```python
# SLERP formula
merged = (sin((1-t)*θ) / sin(θ)) * model1 + (sin(t*θ) / sin(θ)) * model2
# where θ = arccos(dot(model1, model2))
# t ∈ [0, 1]
```
**Task Arithmetic**
- Extract "task vectors" (fine-tuned - base)
- Combine task vectors, add to base
- Good for merging multiple specialized models
```python
# Task vector
task_vector = finetuned_model - base_model
# Merge multiple task vectors
merged = base_model + α₁*task_vector₁ + α₂*task_vector₂
```
**TIES-Merging**
- Task arithmetic + sparsification
- Resolves sign conflicts in parameters
- Best for merging many task-specific models
**DARE (Drop And REscale)**
- Randomly drops fine-tuned parameters
- Rescales remaining parameters
- Reduces redundancy, maintains performance
### 2. Configuration Structure
```yaml
# Basic structure
merge_method: <method> # linear, slerp, ties, dare_ties, task_arithmetic
base_model: <path> # Optional: base model for task arithmetic
models:
- model: <path/to/model1>
parameters:
weight: <float> # Merge weight
density: <float> # For TIES/DARE
- model: <path/to/model2>
parameters:
weight: <float>
parameters:
# Method-specific parameters
dtype: <dtype> # bfloat16, float16, float32
# Optional
slices: # Layer-wise merging
tokenizer: # Tokenizer configuration
```
## Merge Methods Guide
### Linear Merge
**Best for**: Simple model combinations, equal weighting
```yaml
merge_method: linear
models:
- model: WizardLM/WizardMath-7B-V1.1
parameters:
weight: 0.4
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
weight: 0.3
- model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
parameters:
weight: 0.3
dtype: bfloat16
```
### SLERP Merge
**Best for**: Two models, smooth interpolation
```yaml
merge_method: slerp
slices:
- sources:
- model: mistralai/Mistral-7B-v0.1
layer_range: [0, 32]
- model: teknium/OpenHermes-2.5-Mistral-7B
layer_range: [0, 32]
parameters:
t: 0.5 # 0.0 = first model, 1.0 = second model
dtype: bfloat16
```
**Layer-specific SLERP:**
```yaml
merge_method: slerp
slices:
- sources:
- model: model_a
layer_range: [0, 32]
- model: model_b
layer_range: [0, 32]
parameters:
t:
- filter: self_attn # Attention layers
value: 0.3
- filter: mlp # MLP layers
value: 0.7
- value: 0.5 # Default for other layers
dtype: bfloat16
```
### Task Arithmetic
**Best for**: Combining specialized skills
```yaml
merge_method: task_arithmetic
base_model: mistralai/Mistral-7B-v0.1
models:
- model: WizardLM/WizardMath-7B-V1.1 # Math
parameters:
weight: 0.5
- model: teknium/OpenHermes-2.5-Mistral-7B # Chat
parameters:
weight: 0.3
- model: ajibawa-2023/Code-Mistral-7B # Code
parameters:
weight: 0.2
dtype: bfloat16
```
### TIES-Merging
**Best for**: Many models, resolving conflicts
```yaml
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
models:
- model: WizardLM/WizardMath-7B-V1.1
parameters:
density: 0.5 # Keep top 50% of parameters
weight: 1.0
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
density: 0.5
weight: 1.0
- model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
parameters:
density: 0.5
weight: 1.0
parameters:
normalize: true
dtype: bfloat16
```
### DARE Merge
**Best for**: Reducing redundancy
```yaml
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
models:
- model: WizardLM/WizardMath-7B-V1.1
parameters:
density: 0.5 # Drop 50% of deltas
weight: 0.6
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
density: 0.5
weight: 0.4
parameters:
int8_mask: true # Use int8 for masks (saves memory)
dtype: bfloat16
```
## Advanced Patterns
### Layer-wise Merging
```yaml
# Different models for different layers
merge_method: passthrough
slices:
- sources:
- model: mistralai/Mistral-7B-v0.1
layer_range: [0, 16] # First half
- sources:
- model: teknium/OpenHermes-2.5-Mistral-7B
layer_range: [16, 32] # Second half
dtype: bfloat16
```
### MoE from Merged Models
```yaml
# Create Mixture of Experts
merge_method: moe
base_model: mistralai/Mistral-7B-v0.1
experts:
- source_model: WizardLM/WizardMath-7B-V1.1
positive_prompts:
- "math"
- "calculate"
- source_model: teknium/OpenHermes-2.5-Mistral-7B
positive_prompts:
- "chat"
- "conversation"
- source_model: ajibawa-2023/Code-Mistral-7B
positive_prompts:
- "code"
- "python"
dtype: bfloat16
```
### Tokenizer Merging
```yaml
merge_method: linear
models:
- model: mistralai/Mistral-7B-v0.1
- model: custom/specialized-model
tokenizer:
source: "union" # Combine vocabularies from both models
tokens:
<|special_token|>:
source: "custom/specialized-model"
```
## Best Practices
### 1. Model Compatibility
```python
# ✅ Good: Same architecture
models = [
"mistralai/Mistral-7B-v0.1",
"teknium/OpenHermes-2.5-Mistral-7B", # Both Mistral 7B
]
# ❌ Bad: Different architectures
models = [
"meta-llama/Llama-2-7b-hf", # Llama
"mistralai/Mistral-7B-v0.1", # Mistral (incompatible!)
]
```
### 2. Weight Selection
```yaml
# ✅ Good: Weights sum to 1.0
models:
- model: model_a
parameters:
weight: 0.6
- model: model_b
parameters:
weight: 0.4 # 0.6 + 0.4 = 1.0
# ⚠️ Acceptable: Weights don't sum to 1 (for task arithmetic)
models:
- model: model_a
parameters:
weight: 0.8
- model: model_b
parameters:
weight: 0.8 # May boost performance
```
### 3. Method Selection
```python
# Choose merge method based on use case:
# 2 models, smooth blend → SLERP
merge_method = "slerp"
# 3+ models, simple average → Linear
merge_method = "linear"
# Multiple task-specific models → Task Arithmetic or TIES
merge_method = "ties"
# Want to reduce redundancy → DARE
merge_method = "dare_ties"
```
### 4. Density Tuning (TIES/DARE)
```yaml
# Start conservative (keep more parameters)
parameters:
density: 0.8 # Keep 80%
# If performance good, increase sparsity
parameters:
density: 0.5 # Keep 50%
# If performance degrades, reduce sparsity
parameters:
density: 0.9 # Keep 90%
```
### 5. Layer-specific Merging
```yaml
# Preserve base model's beginning and end
merge_method: passthrough
slices:
- sources:
- model: base_model
layer_range: [0, 2] # Keep first layers
- sources:
- model: merged_middle # Merge middle layers
layer_range: [2, 30]
- sources:
- model: base_model
layer_range: [30, 32] # Keep last layers
```
## Evaluation & Testing
### Benchmark Merged Models
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load merged model
model = AutoModelForCausalLM.from_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("./merged-model")
# Test on various tasks
test_prompts = {
"math": "Calculate: 25 * 17 =",
"code": "Write a Python function to reverse a string:",
"chat": "What is the capital of France?",
}
for task, prompt in test_prompts.items():
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(f"{task}: {tokenizer.decode(outputs[0])}")
```
### Common Benchmarks
- **Open LLM Leaderboard**: General capabilities
- **MT-Bench**: Multi-turn conversation
- **MMLU**: Multitask accuracy
- **HumanEval**: Code generation
- **GSM8K**: Math reasoning
## Production Deployment
### Save and Upload
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load merged model
model = AutoModelForCausalLM.from_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("./merged-model")
# Upload to HuggingFace Hub
model.push_to_hub("username/my-merged-model")
tokenizer.push_to_hub("username/my-merged-model")
```
### Quantize Merged Model
```bash
# Quantize with GGUF
python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf
# Quantize with GPTQ
python quantize_gptq.py ./merged-model --bits 4 --group_size 128
```
## Common Pitfalls
### ❌ Pitfall 1: Merging Incompatible Models
```yaml
# Wrong: Different architectures
models:
- model: meta-llama/Llama-2-7b # Llama architecture
- model: mistralai/Mistral-7B # Mistral architecture
```
**Fix**: Only merge models with same architecture
### ❌ Pitfall 2: Over-weighting One Model
```yaml
# Suboptimal: One model dominates
models:
- model: model_a
parameters:
weight: 0.95 # Too high
- model: model_b
parameters:
weight: 0.05 # Too low
```
**Fix**: Use more balanced weights (0.3-0.7 range)
### ❌ Pitfall 3: Not Evaluating
```bash
# Wrong: Merge and deploy without testing
mergekit-yaml config.yml ./merged-model
# Deploy immediately (risky!)
```
**Fix**: Always benchmark before deploying
## Resources
- **mergekit GitHub**: https://github.com/arcee-ai/mergekit
- **HuggingFace Tutorial**: https://huggingface.co/blog/mlabonne/merge-models
- **LazyMergekit**: Automated merging notebook
- **TIES Paper**: https://arxiv.org/abs/2306.01708
- **DARE Paper**: https://arxiv.org/abs/2311.03099
## See Also
- `references/methods.md` - Deep dive into merge algorithms
- `references/examples.md` - Real-world merge configurations
- `references/evaluation.md` - Benchmarking and testing strategiesRelated Skills
adapting-transfer-learning-models
This skill automates the adaptation of pre-trained machine learning models using transfer learning techniques. It is triggered when the user requests assistance with fine-tuning a model, adapting a pre-trained model to a new dataset, or performing transfer learning. It analyzes the user's requirements, generates code for adapting the model, includes data validation and error handling, provides performance metrics, and saves artifacts with documentation. Use this skill when you need to leverage existing models for new tasks or datasets, optimizing for performance and efficiency.
training-machine-learning-models
Build train machine learning models with automated workflows. Analyzes datasets, selects model types (classification, regression), configures parameters, trains with cross-validation, and saves model artifacts. Use when asked to "train model" or "evalua... Trigger with relevant phrases based on skill purpose.
tracking-model-versions
Build this skill enables AI assistant to track and manage ai/ml model versions using the model-versioning-tracker plugin. it should be used when the user asks to manage model versions, track model lineage, log model performance, or implement version control f... Use when appropriate context detected. Trigger with relevant phrases based on skill purpose.
threat-model-creator
Threat Model Creator - Auto-activating skill for Security Advanced. Triggers on: threat model creator, threat model creator Part of the Security Advanced skill category.
tensorflow-savedmodel-creator
Tensorflow Savedmodel Creator - Auto-activating skill for ML Deployment. Triggers on: tensorflow savedmodel creator, tensorflow savedmodel creator Part of the ML Deployment skill category.
tensorflow-model-trainer
Tensorflow Model Trainer - Auto-activating skill for ML Training. Triggers on: tensorflow model trainer, tensorflow model trainer Part of the ML Training skill category.
sequelize-model-creator
Sequelize Model Creator - Auto-activating skill for Backend Development. Triggers on: sequelize model creator, sequelize model creator Part of the Backend Development skill category.
pytorch-model-trainer
Pytorch Model Trainer - Auto-activating skill for ML Training. Triggers on: pytorch model trainer, pytorch model trainer Part of the ML Training skill category.
modeling-nosql-data
This skill enables Claude to design NoSQL data models. It activates when the user requests assistance with NoSQL database design, including schema creation, data modeling for MongoDB or DynamoDB, or defining document structures. Use this skill when the user mentions "NoSQL data model", "design MongoDB schema", "create DynamoDB table", or similar phrases related to NoSQL database architecture. It assists in understanding NoSQL modeling principles like embedding vs. referencing, access pattern optimization, and sharding key selection.
model-versioning-manager
Model Versioning Manager - Auto-activating skill for ML Deployment. Triggers on: model versioning manager, model versioning manager Part of the ML Deployment skill category.
model-registry-manager
Model Registry Manager - Auto-activating skill for ML Deployment. Triggers on: model registry manager, model registry manager Part of the ML Deployment skill category.
model-quantization-tool
Model Quantization Tool - Auto-activating skill for ML Deployment. Triggers on: model quantization tool, model quantization tool Part of the ML Deployment skill category.