mac-code-local-ai-agent
Run a free 35B AI coding agent on Apple Silicon Macs using local LLMs via llama.cpp or MLX with web search, shell, and file tools.
Best use case
mac-code-local-ai-agent is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Run a free 35B AI coding agent on Apple Silicon Macs using local LLMs via llama.cpp or MLX with web search, shell, and file tools.
Teams using mac-code-local-ai-agent should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/mac-code-local-ai-agent/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How mac-code-local-ai-agent Compares
| Feature / Agent | mac-code-local-ai-agent | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run a free 35B AI coding agent on Apple Silicon Macs using local LLMs via llama.cpp or MLX with web search, shell, and file tools.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
SKILL.md Source
# mac-code — Free Local AI Agent on Apple Silicon
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
Run a 35B reasoning model locally on your Mac for $0/month. mac-code is a CLI AI coding agent (Claude Code alternative) that routes tasks — web search, shell commands, file edits, chat — through a local LLM. Supports llama.cpp (30 tok/s) and MLX (64K context, persistent KV cache) backends on Apple Silicon.
---
## What It Does
- **LLM-as-router**: The model classifies every prompt as `search`, `shell`, or `chat` and routes accordingly
- **35B MoE at 30 tok/s** via llama.cpp + IQ2_M quantization (fits in 16 GB RAM)
- **35B full Q4 on 16 GB** via custom MoE Expert Sniper (1.54 tok/s, only 1.42 GB RAM used)
- **9B at 64K context** via quantized KV cache (`q4_0` keys/values)
- **MLX backend** adds persistent KV cache save/load, context compression, R2 sync
- **Tools**: DuckDuckGo search, shell execution, file read/write
---
## Installation
### Prerequisites
```bash
brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages
```
### Clone the repo
```bash
git clone https://github.com/walter-grace/mac-code
cd mac-code
```
### Download models
**35B MoE — fast daily driver (10.6 GB, fits in 16 GB RAM):**
```bash
mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/'
)
"
```
**9B — 64K context, long documents (5.3 GB):**
```bash
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-9B-GGUF',
'Qwen3.5-9B-Q4_K_M.gguf',
local_dir='$HOME/models/'
)
"
```
---
## Starting the Backend
### Option A: llama.cpp + 35B MoE (recommended, 30 tok/s)
```bash
llama-server \
--model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 12288 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -np 1 -t 4
```
### Option B: llama.cpp + 9B (64K context)
```bash
llama-server \
--model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 65536 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -t 4
```
### Option C: MLX backend (persistent context, 9B)
```bash
# Starts server on port 8000, downloads model on first run
python3 mlx/mlx_engine.py
```
### Start the agent (all options)
```bash
python3 agent.py
```
---
## Agent CLI Commands
Inside the agent REPL, type `/` for all commands:
| Command | Action |
|---|---|
| `/agent` | Agent mode with tools (default) |
| `/raw` | Direct streaming, no tools |
| `/model 9b` | Switch to 9B model (64K context) |
| `/model 35b` | Switch to 35B MoE |
| `/search <query>` | Quick DuckDuckGo search |
| `/bench` | Run speed benchmark |
| `/stats` | Session statistics |
| `/cost` | Show cost savings vs cloud |
| `/good` / `/bad` | Grade the last response |
| `/improve` | View response grading stats |
| `/clear` | Reset conversation |
| `/quit` | Exit |
### Example prompts
```
> find all Python files modified in the last 7 days
→ routes to "shell", generates: find . -name "*.py" -mtime -7
> who won the NBA finals
→ routes to "search", queries DuckDuckGo, summarizes
> explain how attention works
→ routes to "chat", streams directly
```
---
## MLX Backend — Persistent KV Cache API
The MLX engine exposes a REST API on `localhost:8000`.
### Save context after processing a large codebase
```bash
curl -X POST localhost:8000/v1/context/save \
-H "Content-Type: application/json" \
-d '{"name": "my-project", "prompt": "$(cat README.md)"}'
```
### Load saved context instantly (0.0003s)
```bash
curl -X POST localhost:8000/v1/context/load \
-H "Content-Type: application/json" \
-d '{"name": "my-project"}'
```
### Download context from Cloudflare R2 (cross-Mac sync)
```bash
# Requires R2 credentials in environment
export R2_ACCOUNT_ID=your_account_id
export R2_ACCESS_KEY_ID=your_key_id
export R2_SECRET_ACCESS_KEY=your_secret
export R2_BUCKET=your_bucket_name
curl -X POST localhost:8000/v1/context/download \
-H "Content-Type: application/json" \
-d '{"name": "my-project"}'
```
### Standard OpenAI-compatible chat
```python
import requests
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Write a Python quicksort"}],
"stream": False
})
print(response.json()["choices"][0]["message"]["content"])
```
### Streaming chat
```python
import requests, json
with requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Explain transformers"}],
"stream": True
}, stream=True) as r:
for line in r.iter_lines():
if line.startswith(b"data: "):
chunk = json.loads(line[6:])
delta = chunk["choices"][0]["delta"].get("content", "")
print(delta, end="", flush=True)
```
---
## KV Cache Compression (MLX)
Compress context 4x with 99.3% similarity:
```python
from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cache
# After building a KV cache from a long document
compressed = compress_kv_cache(kv_cache, bits=4) # 26.6 MB → 6.7 MB
save_kv_cache(compressed, "my-project-compressed")
# Load later
kv = load_kv_cache("my-project-compressed")
```
---
## Flash Streaming — Out-of-Core Inference
For models larger than your RAM (research mode):
```bash
cd research/flash-streaming
# Run 35B MoE Expert Sniper (22 GB model, 1.42 GB RAM)
python3 moe_expert_sniper.py
# Run 32B dense flash stream (18.4 GB model, 4.5 GB RAM)
python3 flash_stream_v2.py
```
### How F_NOCACHE direct I/O works
```python
import os, fcntl
# Open model file bypassing macOS Unified Buffer Cache
fd = os.open("model.bin", os.O_RDONLY)
fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) # bypass page cache
# Aligned read (16KB boundary for DART IOMMU)
ALIGN = 16384
offset = (layer_offset // ALIGN) * ALIGN
data = os.pread(fd, layer_size + ALIGN, offset)
weights = data[layer_offset - offset : layer_offset - offset + layer_size]
```
### MoE Expert Sniper pattern
```python
# Router predicts which 8 of 256 experts activate per token
active_experts = router_forward(hidden_state) # returns [8] indices
# Load only those experts from SSD (8 threads, parallel pread)
from concurrent.futures import ThreadPoolExecutor
def load_expert(expert_idx):
offset = expert_offsets[expert_idx]
return os.pread(fd, expert_size, offset)
with ThreadPoolExecutor(max_workers=8) as pool:
expert_weights = list(pool.map(load_expert, active_experts))
# ~14 MB loaded per layer instead of 221 MB (dense)
```
---
## Common Patterns
### Use as a Python library (direct API calls)
```python
import requests
BASE = "http://localhost:8000/v1"
def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
r = requests.post(f"{BASE}/chat/completions", json={
"model": "local",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
]
})
return r.json()["choices"][0]["message"]["content"]
# Examples
print(ask("Write a Python function to parse JSON safely"))
print(ask("Explain this error: AttributeError: NoneType has no attribute split"))
```
### Process a large file with paged inference
```python
from mlx.paged_inference import PagedInference
engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")
with open("large_codebase.txt") as f:
content = f.read() # beyond single context window
# Automatically pages through content
result = engine.summarize(content, question="What does this codebase do?")
print(result)
```
### Monitor server performance
```bash
python3 dashboard.py
```
---
## Model Selection Guide
| Your Mac RAM | Best Option | Command |
|---|---|---|
| 8 GB | 9B Q4_K_M | `--model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096` |
| 16 GB | 35B IQ2_M (30 tok/s) | Default Option A above |
| 16 GB (quality) | 35B Q4 Expert Sniper | `python3 research/flash-streaming/moe_expert_sniper.py` |
| 48 GB | 35B Q4_K_M native | Download full Q4, `--n-gpu-layers 99` |
| 192 GB | 397B frontier | Any large GGUF, full offload |
---
## Troubleshooting
### Server not responding on port 8000
```bash
# Check if server is running
curl http://localhost:8000/health
# Check what's on port 8000
lsof -i :8000
# Restart llama-server with verbose logging
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --verbose
```
### Model download fails / incomplete
```bash
# Resume interrupted download
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/',
resume_download=True
)
"
```
### Slow inference / RAM pressure on 16 GB Mac
```bash
# Reduce context size to free RAM
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --ctx-size 4096 \ # reduced from 12288
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 -t 4
# Or switch to 9B for lower RAM usage
python3 agent.py
# Then: /model 9b
```
### MLX engine crashes with memory error
```bash
# MLX uses unified memory — check pressure
vm_stat | grep "Pages free"
# Reduce batch size in mlx_engine.py
# Edit: max_batch_size = 512 → max_batch_size = 128
```
### F_NOCACHE not bypassing page cache (macOS Sonoma+)
```python
# Verify F_NOCACHE is active
import fcntl, os
fd = os.open(model_path, os.O_RDONLY)
result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1)
assert result == 0, "F_NOCACHE failed — check macOS version and SIP status"
```
### `ddgs` search fails
```bash
pip3 install --upgrade ddgs --break-system-packages
# ddgs uses DuckDuckGo — no API key required, but may rate-limit
# Retry after 60 seconds if you get a 202 response
```
### Wrong reshape on GGUF dequantization
```python
# GGUF tensors are column-major — correct reshape:
weights = dequantized_flat.reshape(ne[1], ne[0]) # CORRECT
# NOT: dequantized_flat.reshape(ne[0], ne[1]).T # WRONG
```
---
## Architecture Summary
```
agent.py
├── Intent classification → "search" | "shell" | "chat"
├── search → ddgs.DDGS().text() → summarize
├── shell → generate command → subprocess.run()
└── chat → stream directly
Backends (both expose OpenAI-compatible API on :8000)
├── llama.cpp → fast, standard, no persistence
└── mlx/ → KV cache save/load/compress/sync
Flash Streaming (research/)
├── moe_expert_sniper.py → 35B Q4, 1.42 GB RAM
└── flash_stream_v2.py → 32B dense, 4.5 GB RAM
└── F_NOCACHE + pread + 16KB alignment
```Related Skills
ocr-local
Extract text from images using Tesseract.js OCR (100% local, no API key required). Supports Chinese (simplified/traditional) and English.
local-researcher
完全本地的深度研究助手 Skill。使用 Ollama 或 LMStudio 本地 LLM 进行迭代式网络研究,生成带引用来源的 Markdown 报告。当用户需要进行隐私优先的研究、本地文档分析或生成结构化研究报告时触发。
release-note-localizer
将发布说明转换为中文、英文、客户版和技术版,同时保持术语一致。;use for localization, release-notes, translation workflows;do not use for 机翻敏感合同条款, 替代专业法律翻译.
mlx-local-inference
Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/v1. Embedding/ASR/OCR via Python libraries (mlx-lm, mlx-vlm, mlx-audio). Works offline. Use instead of cloud APIs for privacy or low latency.
local-whisper
Local speech-to-text using OpenAI Whisper. Runs fully offline after model download. High quality transcription with multiple model sizes.
local-stt
Local STT with selectable backends - Parakeet (best accuracy) or Whisper (fastest, multilingual).
local-qrcode
Generate QR codes locally from text/URL to PNG image or ASCII art. Pure local generation using qrcode library. No API key required. Use when users need to create QR codes for links, text, or any content.
local-password
Generate secure random passwords and check password strength. Supports custom length and character types (uppercase, lowercase, numbers, symbols). Pure local operation, no external dependencies. Use when users need to generate new secure passwords or check password strength.
agent-browser-local
Headless browser automation CLI optimized for AI agents with accessibility tree snapshots and ref-based element selection
cloud-local-bridge
实现云端 OpenClaw 与本地 OpenClaw 之间的双向通信桥接。支持自然语言配对、命令执行、文件同步。
local-bookmark-librarian
去重和再分类本地导出的书签或链接清单,生成主题索引和维护建议。;use for bookmarks, links, knowledge workflows;do not use for 直接修改浏览器配置, 删除用户未确认链接.
local-rag-index-planner
规划本地知识库的目录、分片粒度、命名、更新时间与访问边界,而不是直接堆 RAG。;use for rag, indexing, knowledge workflows;do not use for 直接部署向量数据库, 忽略权限隔离.