flash-moe-inference
Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming
Best use case
flash-moe-inference is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming
Teams using flash-moe-inference should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/flash-moe-inference/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How flash-moe-inference Compares
| Feature / Agent | flash-moe-inference | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
SKILL.md Source
# Flash-MoE Inference Engine
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
Flash-MoE is a pure C/Objective-C/Metal inference engine that runs **Qwen3.5-397B-A17B** (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.
## Requirements
- **Hardware**: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free
- **OS**: macOS 26+ (Darwin 25+)
- **Tools**: Xcode Command Line Tools, Python 3.x (for weight extraction only)
- **Model**: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)
## Installation & Build
```bash
# Clone the repo
git clone https://github.com/danveloper/flash-moe
cd flash-moe/metal_infer
# Build everything
make
# Verify build artifacts
ls infer chat main
```
The Makefile compiles `infer.m`, `chat.m`, `main.m` with Metal shader compilation for `shaders.metal`.
## Weight Preparation
### Step 1: Extract non-expert weights
```bash
# From the metal_infer/ directory
# Point to your downloaded Qwen3.5-397B safetensors directory
python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/
# Produces:
# model_weights.bin (~5.5GB, mmap'd at runtime)
# model_weights.json (tensor manifest)
# vocab.bin (vocabulary)
# tokenizer.bin (BPE tokenizer data)
```
### Step 2: Pack expert weights (4-bit, production)
```bash
# From repo root
python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/
# Produces packed_experts/ directory (~209GB)
# Each expert is a separate file: layer_XX_expert_YYYY.bin
```
### Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)
```bash
# Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)
python3 metal_infer/repack_experts_2bit.py \
metal_infer/packed_experts/ \
metal_infer/packed_experts_2bit/
```
## Key Commands
### Basic inference
```bash
cd metal_infer
# 4-bit inference (production quality, tool calling works)
./infer --prompt "Explain quantum computing" --tokens 100
# 2-bit inference (faster, breaks JSON/tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit
# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing
```
### Interactive chat with tool calling
```bash
./chat
# Opens TUI with full tool calling support
# Uses 4-bit experts by default
```
### MoE-only benchmark (measures expert throughput)
```bash
./main
# Runs pure expert forward-pass benchmark
# Reports tokens/sec without attention overhead
```
## Project Structure
```
flash-moe/
├── paper/
│ └── flash_moe.pdf # Full technical paper
├── metal_infer/
│ ├── infer.m # Complete inference engine (~7000 lines)
│ ├── shaders.metal # Metal compute kernels (~1200 lines)
│ ├── chat.m # Interactive chat TUI
│ ├── tokenizer.h # Single-header C BPE tokenizer (449 lines)
│ ├── main.m # MoE-only benchmark
│ ├── Makefile
│ ├── extract_weights.py # Safetensors → model_weights.bin
│ ├── repack_experts_2bit.py # 4-bit → 2-bit requantization
│ ├── train_predictor.py # Expert routing prediction analysis
│ ├── model_weights.bin # Non-expert weights (mmap'd)
│ ├── model_weights.json # Tensor manifest
│ ├── vocab.bin
│ ├── tokenizer.bin
│ ├── packed_experts/ # 4-bit expert files (209GB)
│ └── packed_experts_2bit/ # 2-bit expert files (120GB, optional)
├── repack_experts.py # 4-bit expert packing from safetensors
├── progress.py # Results visualization
└── results.tsv # Experiment log
```
## Architecture Overview
The model has **60 transformer layers**:
- 45 GatedDeltaNet (linear attention) layers
- 15 standard full attention layers
- Each layer: 512 experts, K=4 activated per token + 1 shared expert
- Hidden dimension: 4096
### Per-layer pipeline (4.28ms average at 4-bit)
```
CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
→ CPU: flush results [0.01ms CPU]
→ CMD2: o_proj + norm + routing + shared [0.55ms GPU]
→ CPU: softmax + topK routing [0.003ms]
→ I/O: parallel pread K=4 experts [2.41ms SSD]
→ CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]
```
## Metal Shader Kernels
The `shaders.metal` file contains hand-written kernels. Key kernels:
```metal
// 4-bit dequantized matrix-vector multiply (FMA-optimized)
// Key insight: fma(nibble, scale*x, bias*x) instead of (nibble*scale + bias)*x
// Pre-compute scale*x and bias*x to fuse dequant+multiply in one FMA instruction
kernel void matvec_4bit_fma(
device const uint8_t* weights [[buffer(0)]],
device const float* scales [[buffer(1)]],
device const float* biases [[buffer(2)]],
device const float* x [[buffer(3)]],
device float* out [[buffer(4)]],
uint tid [[thread_position_in_threadgroup]],
uint gid [[threadgroup_position_in_grid]])
{
// ... tiled SIMD-reduced FMA kernel
// 12% faster than naive (nibble * scale + bias) * x
}
// Fused SwiGLU activation
kernel void swiglu(device float* gate [[buffer(0)]],
device const float* up [[buffer(1)]],
uint gid [[thread_position_in_grid]])
{
float g = gate[gid];
gate[gid] = (g / (1.0f + exp(-g))) * up[gid];
}
// RMS normalization (two-pass)
kernel void rms_norm_pass1(...) // sum of squares reduction
kernel void rms_norm_pass2(...) // apply normalization
// GPU RoPE (fused with Q deinterleave and K normalization)
kernel void rope_qk(...)
// MoE combine + residual + sigmoid gate (fused)
kernel void moe_combine_residual(...)
```
## SSD Expert Streaming Pattern
The core innovation — loading only K=4 active experts per layer from SSD:
```objc
// Parallel expert loading using GCD dispatch groups
// From infer.m (conceptual pattern)
dispatch_group_t group = dispatch_group_create();
dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);
for (int k = 0; k < K_EXPERTS; k++) {
int expert_id = top_k_indices[k];
dispatch_group_async(group, ioQueue, ^{
// Each expert: ~6.75MB at 4-bit
char path[256];
snprintf(path, sizeof(path),
"packed_experts/layer_%02d_expert_%04d.bin",
layer, expert_id);
int fd = open(path, O_RDONLY);
// pread() — non-blocking, OS page cache handles LRU
pread(fd, expert_buffer[k], expert_size, 0);
close(fd);
});
}
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
// GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon
```
**Why `pread()` not `mmap()`**: mmap incurs per-page fault overhead on cold data (~5x slower). Direct `pread()` with OS page cache achieves ~71% hit rate naturally.
## GatedDeltaNet Linear Attention (BLAS)
The recurrence update uses Accelerate BLAS — 64% faster than scalar:
```objc
// GatedDeltaNet state update per head (conceptual pattern)
// state: 128×128 float matrix, 64 heads
// From infer.m
#import <Accelerate/Accelerate.h>
for (int h = 0; h < 64; h++) {
float* S = state + h * 128 * 128; // 128×128 state matrix
float* q = Q + h * 128;
float* k = K + h * 128;
float* v = V + h * 128;
// β·(k⊗v) outer product update
// cblas_sger: S += beta * (k ⊗ v)
cblas_sger(CblasRowMajor, 128, 128,
beta[h], k, 1, v, 1, S, 128);
// Decay: S = alpha * S
cblas_sscal(128 * 128, alpha[h], S, 1);
// Output: o = S @ q
cblas_sgemv(CblasRowMajor, CblasNoTrans,
128, 128, 1.0f, S, 128, q, 1, 0.0f,
output + h * 128, 1);
}
```
## Performance Configuration
### 4-bit (production default)
- **Quality**: Excellent — full tool calling, correct JSON
- **Speed**: 4.36 tok/s
- **Disk**: 209GB
### 2-bit (speed testing only)
- **Quality**: Good — but breaks JSON/tool calling (`\name\` instead of `"name"`)
- **Speed**: 5.74 tok/s (7.05 peak single token with warm cache)
- **Disk**: 120GB
- Uses `F_NOCACHE` flag to avoid page cache thrashing
## What NOT to Try (Learned from 58 Experiments)
| Approach | Why it fails |
|----------|-------------|
| `mmap()` expert files | Per-page fault overhead: 5x slower than `pread()` |
| `dispatch_io` | `dispatch_data` management overhead: -70% |
| `F_RDADVISE` prefetch | SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed |
| Custom Metal LRU cache | GPU memory pressure: -38% vs OS page cache |
| LZ4 expert compression | Decompress overhead > warm cache savings: -13% |
| Temporal expert prediction | 25% hit rate, wastes SSD bandwidth: -18% |
| Speculative early routing | Cache pollution: -38% |
| MTP speculative decoding | MoE I/O scales per-token (unlike dense models): break-even |
| Spin-poll GPU wait | CPU thermal throttle competes with GPU: -23% |
| Parallel SSD + GPU overlap | Unified memory controller arbitration: net negative |
**Key principle**: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.
## Troubleshooting
### Build fails
```bash
# Ensure Xcode CLI tools are installed
xcode-select --install
# Check Metal compiler is available
xcrun -sdk macosx metal --version
```
### Out of memory
The engine is designed to use ~6GB active:
- 5.5GB: `model_weights.bin` (mmap'd, read-only)
- ~200MB: Metal scratch buffers
- Remaining ~42GB: OS page cache for expert data
If you see OOM, check for other processes consuming unified memory:
```bash
sudo memory_pressure
vm_stat
```
### Slow performance
```bash
# Check SSD speed — needs ~17GB/s for target performance
# Run with timing to identify bottleneck
./infer --prompt "Hello" --tokens 5 --timing
# Verify packed_experts/ is on internal SSD, not external drive
diskutil info /
```
### Wrong expert directory
```bash
# Default paths expected by infer.m:
# metal_infer/packed_experts/ (4-bit)
# metal_infer/packed_experts_2bit/ (2-bit)
# Ensure you're running from metal_infer/ directory
cd metal_infer
./infer --prompt "test"
```
### Tool calling broken
Use 4-bit, not 2-bit. The 2-bit quantization corrupts quote characters in JSON output, making tool calling unreliable. Always use the default 4-bit configuration for agentic workloads.
## Memory Safety
The engine explicitly manages all allocations:
- No unbounded caches
- Expert data never accumulates in GPU memory
- `model_weights.bin` is mmap'd read-only — kernel manages pages
- Expert files are opened/read/closed per inference stepRelated Skills
mlx-local-inference
Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/v1. Embedding/ASR/OCR via Python libraries (mlx-lm, mlx-vlm, mlx-audio). Works offline. Use instead of cloud APIs for privacy or low latency.
astrai-inference-router
Route all LLM calls through Astrai for 40%+ cost savings with intelligent routing and privacy controls
---
name: article-factory-wechat
humanizer
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.
find-skills
Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
tavily-search
Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.
baidu-search
Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.
agent-autonomy-kit
Stop waiting for prompts. Keep working.
Meeting Prep
Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.
self-improvement
Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.
botlearn-healthcheck
botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.
linkedin-cli
A bird-like LinkedIn CLI for searching profiles, checking messages, and summarizing your feed using session cookies.