mlx-local-inference

Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/v1. Embedding/ASR/OCR via Python libraries (mlx-lm, mlx-vlm, mlx-audio). Works offline. Use instead of cloud APIs for privacy or low latency.

3,891 stars

Best use case

mlx-local-inference is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/v1. Embedding/ASR/OCR via Python libraries (mlx-lm, mlx-vlm, mlx-audio). Works offline. Use instead of cloud APIs for privacy or low latency.

Teams using mlx-local-inference should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/mlx-local-inference/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/bendusy/mlx-local-inference/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/mlx-local-inference/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How mlx-local-inference Compares

Feature / Agentmlx-local-inferenceStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/v1. Embedding/ASR/OCR via Python libraries (mlx-lm, mlx-vlm, mlx-audio). Works offline. Use instead of cloud APIs for privacy or low latency.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# MLX Local Inference Stack

Local AI inference on Apple Silicon. **oMLX** handles LLM/VLM with continuous batching.
Python libraries handle Embedding/ASR/OCR directly via `uv`.

## Architecture

```
┌─────────────────────────────────────┐
│  oMLX (localhost:8000/v1)           │
│  - LLM (Qwen3.5-35B, etc.)          │
│  - VLM (vision-language models)     │
│  - Continuous batching + SSD cache  │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│  Python Libraries (via uv run)      │
│  - mlx-lm: Embedding                │
│  - mlx-vlm: OCR (PaddleOCR-VL)      │
│  - mlx-audio: ASR (Qwen3-ASR)       │
└─────────────────────────────────────┘
```

## Models

| Capability | Implementation | Model | Size |
|-----------|---------------|-------|------|
| 💬 LLM | oMLX API | `Qwen3.5-35B-A3B-4bit` | ~20 GB |
| 👁️ VLM | oMLX API | Any mlx-vlm model | varies |
| 📐 Embed | mlx-lm (uv) | `Qwen3-Embedding-0.6B-4bit-DWQ` | ~1 GB |
| 🎤 ASR | mlx-audio (uv) | `Qwen3-ASR-1.7B-8bit` | ~1.5 GB |
| 👁️ OCR | mlx-vlm (uv) | `PaddleOCR-VL-1.5-6bit` | ~3.3 GB |

## Usage

### LLM / Vision-Language (via oMLX API)

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")

# Text generation
resp = client.chat.completions.create(
    model="Qwen3.5-35B-A3B-4bit",
    messages=[{"role": "user", "content": "Hello"}]
)
print(resp.choices[0].message.content)
```

---

### Embeddings (via mlx-lm + uv)

```bash
uv run --with mlx-lm python -c "
from mlx_lm import load
model, tokenizer = load('~/models/Qwen3-Embedding-0.6B-4bit-DWQ')
text = 'text to embed'
inputs = tokenizer(text, return_tensors='np')
embeddings = model(**inputs).last_hidden_state.mean(axis=1)
print(embeddings.shape)
"
```

---

### ASR — Speech-to-Text (via mlx-audio + uv)

> **Important:** Must run with `--python 3.11` to avoid OpenMP threading issues (`SIGSEGV`).

```bash
uv run --python 3.11 --with mlx-audio python -m mlx_audio.stt.generate \
  --model ~/models/Qwen3-ASR-1.7B-8bit \
  --audio "audio.wav" \
  --output-path /tmp/asr_result \
  --format txt \
  --language zh \
  --verbose
```

---

### OCR (via mlx-vlm + uv)

> **Important:** The `generate` function parameter order must be `(model, processor, prompt, image)`.

```bash
cat << 'PY_EOF' > run_ocr.py
import os
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model_path = os.path.expanduser("~/models/PaddleOCR-VL-1.5-6bit")
model, processor = load(model_path)
prompt = apply_chat_template(processor, config=model.config, prompt="OCR:", num_images=1)

output = generate(model, processor, prompt, "document.jpg", max_tokens=512, temp=0.0)
print(output.text)
PY_EOF

uv run --python 3.11 --with mlx-vlm python run_ocr.py
```

---

## Service Management (oMLX only)

```bash
# Check running models
curl http://localhost:8000/v1/models

# Restart oMLX
launchctl kickstart -k gui/$(id -u)/com.omlx-server
```

## Model Storage Strategy

**All models stored in `~/models/` using oMLX-compatible structure:**

```
~/models/
├── Qwen3-Embedding-0.6B-4bit-DWQ/
├── Qwen3-ASR-1.7B-8bit/
├── PaddleOCR-VL-1.5-6bit/
└── Qwen3.5-35B-A3B-4bit/
```

## Requirements

- Apple Silicon Mac (M1/M2/M3/M4)
- `uv` installed (`curl -LsSf https://astral.sh/uv/install.sh | sh`)

Related Skills

ocr-local

3891
from openclaw/skills

Extract text from images using Tesseract.js OCR (100% local, no API key required). Supports Chinese (simplified/traditional) and English.

General Utilities

local-researcher

3891
from openclaw/skills

完全本地的深度研究助手 Skill。使用 Ollama 或 LMStudio 本地 LLM 进行迭代式网络研究,生成带引用来源的 Markdown 报告。当用户需要进行隐私优先的研究、本地文档分析或生成结构化研究报告时触发。

release-note-localizer

3891
from openclaw/skills

将发布说明转换为中文、英文、客户版和技术版,同时保持术语一致。;use for localization, release-notes, translation workflows;do not use for 机翻敏感合同条款, 替代专业法律翻译.

astrai-inference-router

3891
from openclaw/skills

Route all LLM calls through Astrai for 40%+ cost savings with intelligent routing and privacy controls

local-whisper

3891
from openclaw/skills

Local speech-to-text using OpenAI Whisper. Runs fully offline after model download. High quality transcription with multiple model sizes.

local-stt

3891
from openclaw/skills

Local STT with selectable backends - Parakeet (best accuracy) or Whisper (fastest, multilingual).

local-qrcode

3891
from openclaw/skills

Generate QR codes locally from text/URL to PNG image or ASCII art. Pure local generation using qrcode library. No API key required. Use when users need to create QR codes for links, text, or any content.

local-password

3891
from openclaw/skills

Generate secure random passwords and check password strength. Supports custom length and character types (uppercase, lowercase, numbers, symbols). Pure local operation, no external dependencies. Use when users need to generate new secure passwords or check password strength.

agent-browser-local

3891
from openclaw/skills

Headless browser automation CLI optimized for AI agents with accessibility tree snapshots and ref-based element selection

cloud-local-bridge

3891
from openclaw/skills

实现云端 OpenClaw 与本地 OpenClaw 之间的双向通信桥接。支持自然语言配对、命令执行、文件同步。

local-bookmark-librarian

3891
from openclaw/skills

去重和再分类本地导出的书签或链接清单,生成主题索引和维护建议。;use for bookmarks, links, knowledge workflows;do not use for 直接修改浏览器配置, 删除用户未确认链接.

local-rag-index-planner

3880
from openclaw/skills

规划本地知识库的目录、分片粒度、命名、更新时间与访问边界,而不是直接堆 RAG。;use for rag, indexing, knowledge workflows;do not use for 直接部署向量数据库, 忽略权限隔离.