gemma-tuner-multimodal

Fine-tune Gemma 4 and 3n models with audio, images, and text on Apple Silicon using PyTorch and Metal Performance Shaders.

22 stars

byAradotso

View on GitHub Installation ↓

Best use case

gemma-tuner-multimodal is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Fine-tune Gemma 4 and 3n models with audio, images, and text on Apple Silicon using PyTorch and Metal Performance Shaders.

Teams using gemma-tuner-multimodal should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/gemma-tuner-multimodal/SKILL.md --create-dirs "https://raw.githubusercontent.com/Aradotso/trending-skills/main/skills/gemma-tuner-multimodal/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/gemma-tuner-multimodal/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How gemma-tuner-multimodal Compares

Feature / Agent	gemma-tuner-multimodal	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Fine-tune Gemma 4 and 3n models with audio, images, and text on Apple Silicon using PyTorch and Metal Performance Shaders.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Gemma Multimodal Fine-Tuner

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

Fine-tune Gemma 4 and Gemma 3n models on text, images, and audio data entirely on Apple Silicon (MPS), with support for streaming large datasets from GCS/BigQuery without filling local storage.

---

## What It Does

- **Text LoRA**: instruction-tuning or completion fine-tuning from local CSV
- **Image + Text LoRA**: captioning and VQA from local CSV
- **Audio + Text LoRA**: the only Apple-Silicon-native path for this modality
- **Cloud streaming**: train on terabytes from GCS/BigQuery without local copy
- **MPS-native**: no NVIDIA GPU required — runs on MacBook Pro/Air/Mac Studio

---

## Installation

### Prerequisites
- macOS 12.3+ with Apple Silicon (arm64)
- Python 3.10+ (native arm64, not Rosetta)
- Hugging Face account with Gemma access

```bash
# Install Python 3.12 if needed
brew install python@3.12

# Create venv
python3.12 -m venv .venv
source .venv/bin/activate

# Verify arm64 (must show arm64, not x86_64)
python -c "import platform; print(platform.machine())"

# Install PyTorch
pip install torch torchaudio

# Clone and install
git clone https://github.com/mattmireles/gemma-tuner-multimodal
cd gemma-tuner-multimodal
pip install -e .

# For Gemma 4 support (separate venv recommended)
pip install -r requirements/requirements-gemma4.txt
```

### Authenticate with Hugging Face
```bash
huggingface-cli login
# Or set environment variable:
export HF_TOKEN=your_token_here
```

---

## CLI Commands

```bash
# Check system is ready
gemma-macos-tuner system-check

# Guided setup wizard (recommended for first run)
gemma-macos-tuner wizard

# Prepare dataset
gemma-macos-tuner prepare <dataset-profile>

# Fine-tune a model
gemma-macos-tuner finetune <profile> --json-logging

# Evaluate a run
gemma-macos-tuner evaluate <profile-or-run>

# Export merged HF/SafeTensors (merges LoRA when adapter_config.json present)
gemma-macos-tuner export <run-dir-or-profile>

# Blacklist bad samples from errors
gemma-macos-tuner blacklist <profile>

# List training runs
gemma-macos-tuner runs list
```

---

## Configuration (`config/config.ini`)

The config is hierarchical INI: defaults → groups → models → datasets → profiles.

```ini
[defaults]
output_dir = output
batch_size = 2
gradient_accumulation_steps = 8
learning_rate = 2e-4
num_train_epochs = 3

[model:gemma-3n-e2b-it]
group = gemma
base_model = google/gemma-3n-E2B-it

[model:gemma-4-e2b-it]
group = gemma
base_model = google/gemma-4-E2B-it

[dataset:my-audio-dataset]
data_dir = data/datasets/my-audio-dataset
audio_column = audio_path
text_column = transcript

[profile:my-audio-profile]
model = gemma-3n-e2b-it
dataset = my-audio-dataset
modality = audio
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05
max_seq_length = 512
```

Use `GEMMA_TUNER_CONFIG` env var to point to config outside repo root:
```bash
export GEMMA_TUNER_CONFIG=/path/to/my/config.ini
```

---

## Modality Configuration

### Text-Only Fine-Tuning

**Instruction tuning** (user/assistant pairs):
```ini
[profile:text-instruction]
model = gemma-3n-e2b-it
dataset = my-text-dataset
modality = text
text_sub_mode = instruction
prompt_column = prompt
text_column = response
max_seq_length = 2048
lora_r = 16
lora_alpha = 32
```

**Completion tuning** (full sequence trained):
```ini
[profile:text-completion]
model = gemma-3n-e2b-it
dataset = my-text-dataset
modality = text
text_sub_mode = completion
text_column = text
max_seq_length = 2048
```

**CSV format** for instruction tuning (`data/datasets/my-text-dataset/train.csv`):
```csv
prompt,response
"What is photosynthesis?","Photosynthesis is the process by which plants..."
"Explain LoRA fine-tuning","LoRA (Low-Rank Adaptation) is a parameter-efficient..."
```

### Image Fine-Tuning

```ini
[profile:image-caption]
model = gemma-3n-e2b-it
dataset = my-image-dataset
modality = image
image_sub_mode = captioning
image_token_budget = 256
prompt_column = prompt
text_column = caption
max_seq_length = 512
```

**CSV format** (`data/datasets/my-image-dataset/train.csv`):
```csv
image_path,prompt,caption
/data/images/img1.jpg,Describe this image,A dog sitting on a green lawn...
/data/images/img2.jpg,What is shown here,A bar chart showing quarterly revenue...
```

### Audio Fine-Tuning

```ini
[profile:audio-asr]
model = gemma-3n-e2b-it
dataset = my-audio-dataset
modality = audio
audio_column = audio_path
text_column = transcript
max_seq_length = 512
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05
```

**CSV format** (`data/datasets/my-audio-dataset/train.csv`):
```csv
audio_path,transcript
/data/audio/recording1.wav,The patient presents with acute respiratory symptoms
/data/audio/recording2.wav,Counsel objects to the characterization of the evidence
```

---

## Supported Models

| Model Key | Hugging Face ID | Notes |
|---|---|---|
| `gemma-3n-e2b-it` | `google/gemma-3n-E2B-it` | Default, ~2B instruct |
| `gemma-3n-e4b-it` | `google/gemma-3n-E4B-it` | ~4B instruct |
| `gemma-4-e2b-it` | `google/gemma-4-E2B-it` | Needs requirements-gemma4.txt |
| `gemma-4-e4b-it` | `google/gemma-4-E4B-it` | Needs requirements-gemma4.txt |
| `gemma-4-e2b` | `google/gemma-4-E2B` | Base, needs Gemma 4 stack |
| `gemma-4-e4b` | `google/gemma-4-E4B` | Base, needs Gemma 4 stack |

Add custom models with a `[model:your-name]` section using `group = gemma`.

---

## Dataset Directory Layout

```
data/
└── datasets/
    └── <dataset-name>/
        ├── train.csv       # required
        ├── validation.csv  # optional
        └── test.csv        # optional
```

---

## Output Layout

```
output/
└── {run-id}-{profile}/
    ├── metadata.json
    ├── metrics.json
    ├── checkpoint-*/
    └── adapter_model/      # LoRA artifacts
```

---

## Python API Examples

### Running Fine-Tuning Programmatically

```python
from gemma_tuner.core.config import load_config
from gemma_tuner.core.ops import run_finetune

# Load config
config = load_config("config/config.ini")

# Run fine-tuning for a profile
run_finetune(profile="my-audio-profile", config=config, json_logging=True)
```

### Using Device Utilities

```python
from gemma_tuner.utils.device import get_device, memory_hint

device = get_device()   # Returns "mps", "cuda", or "cpu"
print(f"Training on: {device}")

hint = memory_hint(model_key="gemma-3n-e2b-it")
print(hint)
```

### Loading and Inspecting Datasets

```python
from gemma_tuner.utils.dataset_utils import load_csv_dataset

train_df, val_df = load_csv_dataset(
    data_dir="data/datasets/my-text-dataset",
    text_column="response",
    prompt_column="prompt"
)
print(f"Train samples: {len(train_df)}, Val samples: {len(val_df)}")
```

### Custom LoRA Config

```python
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3n-E2B-it",
    torch_dtype="auto",
    device_map="mps"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
```

---

## Common Patterns

### Full Workflow: Text Instruction Tuning

```bash
# 1. Prepare your data
mkdir -p data/datasets/my-dataset
cp train.csv data/datasets/my-dataset/
cp validation.csv data/datasets/my-dataset/

# 2. Add profile to config/config.ini
cat >> config/config.ini << 'EOF'
[dataset:my-dataset]
data_dir = data/datasets/my-dataset

[profile:my-text-run]
model = gemma-3n-e2b-it
dataset = my-dataset
modality = text
text_sub_mode = instruction
prompt_column = prompt
text_column = response
max_seq_length = 2048
lora_r = 16
lora_alpha = 32
EOF

# 3. Prepare dataset
gemma-macos-tuner prepare my-dataset

# 4. Fine-tune
gemma-macos-tuner finetune my-text-run --json-logging

# 5. Export merged weights
gemma-macos-tuner export my-text-run
```

### GCS Streaming for Large Datasets

```ini
[dataset:large-audio-gcs]
source = gcs
gcs_bucket = my-bucket
gcs_prefix = audio-training-data/
audio_column = audio_path
text_column = transcript

[profile:large-audio-run]
model = gemma-3n-e4b-it
dataset = large-audio-gcs
modality = audio
lora_r = 32
lora_alpha = 64
```

Set credentials:
```bash
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
gemma-macos-tuner finetune large-audio-run
```

### Add a Custom Gemma Checkpoint

```ini
[model:my-custom-gemma]
group = gemma
base_model = my-org/my-gemma-checkpoint

[profile:custom-run]
model = my-custom-gemma
dataset = my-dataset
modality = text
text_sub_mode = instruction
```

---

## Troubleshooting

### Wrong architecture (x86_64 instead of arm64)
```bash
python -c "import platform; print(platform.machine())"
# Must be arm64 — if x86_64, reinstall Python natively:
brew install python@3.12
python3.12 -m venv .venv && source .venv/bin/activate
```

### MPS out of memory
- Reduce `batch_size` (try 1)
- Increase `gradient_accumulation_steps` to compensate
- Use a smaller model (`e2b` instead of `e4b`)
- Reduce `max_seq_length`

### Gemma 4 model not loading
```bash
# Gemma 4 requires the updated Transformers stack
pip install -r requirements/requirements-gemma4.txt
# Use a separate venv if you also need Gemma 3n
```

### Config not found outside repo root
```bash
export GEMMA_TUNER_CONFIG=/absolute/path/to/config/config.ini
gemma-macos-tuner finetune my-profile
```

### Hugging Face auth errors
```bash
huggingface-cli login
# Or:
export HF_TOKEN=your_hf_token
# Accept Gemma license at: https://huggingface.co/google/gemma-3n-E2B-it
```

### System check before debugging anything else
```bash
gemma-macos-tuner system-check
```

### Audio tower loaded even for text-only runs
This is a known v1 issue — USM audio tower weights stay in memory even for `modality = text`. See `README/KNOWN_ISSUES.md`. Workaround: use a smaller model variant to stay within RAM budget.

---

## Architecture Reference

| File | Role |
|---|---|
| `gemma_tuner/cli_typer.py` | Main CLI entrypoint (`gemma-macos-tuner`) |
| `gemma_tuner/core/ops.py` | Dispatches prepare/finetune/evaluate/export |
| `gemma_tuner/scripts/finetune.py` | Router: Gemma models → `models/gemma/finetune.py` |
| `gemma_tuner/models/gemma/finetune.py` | Core training loop with LoRA |
| `gemma_tuner/scripts/export.py` | Merges LoRA → HF/SafeTensors tree |
| `gemma_tuner/utils/device.py` | MPS/CUDA/CPU selection and memory hints |
| `gemma_tuner/utils/dataset_utils.py` | CSV loading, blacklist/protection semantics |
| `gemma_tuner/wizard/` | Interactive CLI wizard (questionary + Rich) |
| `config/config.ini` | Hierarchical INI configuration |

Related Skills

gemma-gem-browser-ai

from Aradotso/trending-skills

Build and extend Gemma Gem, an on-device AI browser assistant Chrome extension running Google's Gemma 4 model via WebGPU with no cloud dependencies.

```markdown

from Aradotso/trending-skills

---

zeroboot-vm-sandbox

from Aradotso/trending-skills

Sub-millisecond VM sandboxes for AI agents using copy-on-write KVM forking via Zeroboot

yourvpndead-vpn-detection

from Aradotso/trending-skills

Android app that detects VPN/proxy servers (VLESS/xray/sing-box) via local SOCKS5 vulnerability, exposing exit IPs and server configs without root

xata-postgres-platform

from Aradotso/trending-skills

Expert skill for Xata open-source cloud-native Postgres platform with copy-on-write branching, scale-to-zero, and Kubernetes deployment

x-mentor-skill-nuwa

from Aradotso/trending-skills

AI-powered X (Twitter) content strategy skill that distills methodologies from 6 top creators + open-source algorithm data into actionable writing, growth, and monetization guidance.

wx-favorites-report

from Aradotso/trending-skills

End-to-end pipeline to extract, decrypt, and visualize WeChat Mac favorites from encrypted SQLite DB into an interactive HTML report.

wterm-web-terminal

from Aradotso/trending-skills

Web terminal emulator with Zig/WASM core, DOM rendering, and React/vanilla JS bindings

worldmonitor-intelligence-dashboard

from Aradotso/trending-skills

Real-time global intelligence dashboard with AI-powered news aggregation, geopolitical monitoring, and infrastructure tracking

witr-process-inspector

from Aradotso/trending-skills

CLI and TUI tool that explains why processes, services, and ports are running by tracing causality chains across supervisors, containers, and shells.

wildworld-dataset

from Aradotso/trending-skills

WildWorld large-scale action-conditioned world modeling dataset with 108M+ frames from a photorealistic ARPG game, featuring per-frame annotations, 450+ actions, and explicit state information for generative world modeling research.

whatcable-macos-usb-inspector

from Aradotso/trending-skills

macOS menu bar app that identifies USB-C cable capabilities and charging diagnostics using IOKit