mlx-local-inference
Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/v1. Embedding/ASR/OCR via Python libraries (mlx-lm, mlx-vlm, mlx-audio). Works offline. Use instead of cloud APIs for privacy or low latency.
Best use case
mlx-local-inference is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/v1. Embedding/ASR/OCR via Python libraries (mlx-lm, mlx-vlm, mlx-audio). Works offline. Use instead of cloud APIs for privacy or low latency.
Teams using mlx-local-inference should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/mlx-local-inference/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How mlx-local-inference Compares
| Feature / Agent | mlx-local-inference | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/v1. Embedding/ASR/OCR via Python libraries (mlx-lm, mlx-vlm, mlx-audio). Works offline. Use instead of cloud APIs for privacy or low latency.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
SKILL.md Source
# MLX Local Inference Stack
Local AI inference on Apple Silicon. **oMLX** handles LLM/VLM with continuous batching.
Python libraries handle Embedding/ASR/OCR directly via `uv`.
## Architecture
```
┌─────────────────────────────────────┐
│ oMLX (localhost:8000/v1) │
│ - LLM (Qwen3.5-35B, etc.) │
│ - VLM (vision-language models) │
│ - Continuous batching + SSD cache │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Python Libraries (via uv run) │
│ - mlx-lm: Embedding │
│ - mlx-vlm: OCR (PaddleOCR-VL) │
│ - mlx-audio: ASR (Qwen3-ASR) │
└─────────────────────────────────────┘
```
## Models
| Capability | Implementation | Model | Size |
|-----------|---------------|-------|------|
| 💬 LLM | oMLX API | `Qwen3.5-35B-A3B-4bit` | ~20 GB |
| 👁️ VLM | oMLX API | Any mlx-vlm model | varies |
| 📐 Embed | mlx-lm (uv) | `Qwen3-Embedding-0.6B-4bit-DWQ` | ~1 GB |
| 🎤 ASR | mlx-audio (uv) | `Qwen3-ASR-1.7B-8bit` | ~1.5 GB |
| 👁️ OCR | mlx-vlm (uv) | `PaddleOCR-VL-1.5-6bit` | ~3.3 GB |
## Usage
### LLM / Vision-Language (via oMLX API)
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
# Text generation
resp = client.chat.completions.create(
model="Qwen3.5-35B-A3B-4bit",
messages=[{"role": "user", "content": "Hello"}]
)
print(resp.choices[0].message.content)
```
---
### Embeddings (via mlx-lm + uv)
```bash
uv run --with mlx-lm python -c "
from mlx_lm import load
model, tokenizer = load('~/models/Qwen3-Embedding-0.6B-4bit-DWQ')
text = 'text to embed'
inputs = tokenizer(text, return_tensors='np')
embeddings = model(**inputs).last_hidden_state.mean(axis=1)
print(embeddings.shape)
"
```
---
### ASR — Speech-to-Text (via mlx-audio + uv)
> **Important:** Must run with `--python 3.11` to avoid OpenMP threading issues (`SIGSEGV`).
```bash
uv run --python 3.11 --with mlx-audio python -m mlx_audio.stt.generate \
--model ~/models/Qwen3-ASR-1.7B-8bit \
--audio "audio.wav" \
--output-path /tmp/asr_result \
--format txt \
--language zh \
--verbose
```
---
### OCR (via mlx-vlm + uv)
> **Important:** The `generate` function parameter order must be `(model, processor, prompt, image)`.
```bash
cat << 'PY_EOF' > run_ocr.py
import os
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model_path = os.path.expanduser("~/models/PaddleOCR-VL-1.5-6bit")
model, processor = load(model_path)
prompt = apply_chat_template(processor, config=model.config, prompt="OCR:", num_images=1)
output = generate(model, processor, prompt, "document.jpg", max_tokens=512, temp=0.0)
print(output.text)
PY_EOF
uv run --python 3.11 --with mlx-vlm python run_ocr.py
```
---
## Service Management (oMLX only)
```bash
# Check running models
curl http://localhost:8000/v1/models
# Restart oMLX
launchctl kickstart -k gui/$(id -u)/com.omlx-server
```
## Model Storage Strategy
**All models stored in `~/models/` using oMLX-compatible structure:**
```
~/models/
├── Qwen3-Embedding-0.6B-4bit-DWQ/
├── Qwen3-ASR-1.7B-8bit/
├── PaddleOCR-VL-1.5-6bit/
└── Qwen3.5-35B-A3B-4bit/
```
## Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- `uv` installed (`curl -LsSf https://astral.sh/uv/install.sh | sh`)Related Skills
ocr-local
Extract text from images using Tesseract.js OCR (100% local, no API key required). Supports Chinese (simplified/traditional) and English.
local-researcher
完全本地的深度研究助手 Skill。使用 Ollama 或 LMStudio 本地 LLM 进行迭代式网络研究,生成带引用来源的 Markdown 报告。当用户需要进行隐私优先的研究、本地文档分析或生成结构化研究报告时触发。
release-note-localizer
将发布说明转换为中文、英文、客户版和技术版,同时保持术语一致。;use for localization, release-notes, translation workflows;do not use for 机翻敏感合同条款, 替代专业法律翻译.
astrai-inference-router
Route all LLM calls through Astrai for 40%+ cost savings with intelligent routing and privacy controls
local-whisper
Local speech-to-text using OpenAI Whisper. Runs fully offline after model download. High quality transcription with multiple model sizes.
local-stt
Local STT with selectable backends - Parakeet (best accuracy) or Whisper (fastest, multilingual).
local-qrcode
Generate QR codes locally from text/URL to PNG image or ASCII art. Pure local generation using qrcode library. No API key required. Use when users need to create QR codes for links, text, or any content.
local-password
Generate secure random passwords and check password strength. Supports custom length and character types (uppercase, lowercase, numbers, symbols). Pure local operation, no external dependencies. Use when users need to generate new secure passwords or check password strength.
agent-browser-local
Headless browser automation CLI optimized for AI agents with accessibility tree snapshots and ref-based element selection
cloud-local-bridge
实现云端 OpenClaw 与本地 OpenClaw 之间的双向通信桥接。支持自然语言配对、命令执行、文件同步。
local-bookmark-librarian
去重和再分类本地导出的书签或链接清单,生成主题索引和维护建议。;use for bookmarks, links, knowledge workflows;do not use for 直接修改浏览器配置, 删除用户未确认链接.
local-rag-index-planner
规划本地知识库的目录、分片粒度、命名、更新时间与访问边界,而不是直接堆 RAG。;use for rag, indexing, knowledge workflows;do not use for 直接部署向量数据库, 忽略权限隔离.