mlx-audio-server
A fast, accurate, and fully local OpenAI-compatible API server for speech-to-text and text-to-speech, powered by MLX on Apple Silicon and open-source models.
Best use case
mlx-audio-server is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
A fast, accurate, and fully local OpenAI-compatible API server for speech-to-text and text-to-speech, powered by MLX on Apple Silicon and open-source models.
Teams using mlx-audio-server should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/mlx-audio-server/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How mlx-audio-server Compares
| Feature / Agent | mlx-audio-server | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
A fast, accurate, and fully local OpenAI-compatible API server for speech-to-text and text-to-speech, powered by MLX on Apple Silicon and open-source models.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# MLX Audio Server
`mlx-audio`: The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.
This skill will run it as a OpenAI-compatible API server on macOS in background, and provide scripts/examples for AI agents to use the api.
Default Models:
- Speech-To-Text: `mlx-community/glm-asr-nano-2512-8bit`
- Text-To-Speech: `mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16`
The server will download these models when needed, so first run will be a bit slow.
More choices here: https://github.com/Blaizzy/mlx-audio?tab=readme-ov-file#supported-models
## Requirements
- `mlx`: macOS with Apple Silicon
- `brew`: used to install deps if not available
## Installation
```bash
bash ${baseDir}/install.sh
```
This script will:
- clone (forked) mlx-audio repo into `~/opt/mlx-audio`
- use `uv` to create a venv and install deps in it: `~/opt/mlx-audio/.venv`
- create a plist file to run mlx-audio server as a launchd service in background in user domain
- run as a OpenAI compatible API server, on port 8899 by default.
## Usage
STT/Speech-To-Text:
```bash
# input will be converted to wav with ffmpeg, if not yet.
# output will be transcript text only.
bash ${baseDir}/run_stt.sh <audio_or_video_path>
```
TTS/Text-To-Speech:
```bash
# audio will be saved into a tmp dir, with default name `speech.wav`, and print to stdout.
bash ${baseDir}/run_tts.sh "Hello, Human!"
# or you can specify a output dir
bash ${baseDir}/run_tts.sh "Hello, Human!" ./output
# output will be audio path only.
```
You can use both scripts directly, or as example/reference.Related Skills
munger-observer
Daily wisdom review applying Charlie Munger's mental models to your work and thinking. Use when asked to review decisions, analyze thinking patterns, detect biases, apply mental models, do a "Munger review", or run the Munger Observer. Triggers on scheduled daily reviews or manual requests like "run munger observer", "review my thinking", "check for blind spots", or "apply mental models".
iyeque-audio-processing
Audio ingestion, analysis, transformation, and generation (Transcribe, TTS, VAD, Features).
audio-processing
Audio ingestion, analysis, transformation, and generation (Transcribe, TTS, VAD, Features).
homeserver
Homelab server management via homebutler CLI.
fosmvvm-serverrequest-test-generator
Generate ServerRequest tests using VaporTesting.
fosmvvm-serverrequest-generator
Generate FOSMVVM ServerRequest types for CRUD operations and client-server communication.
eachlabs-voice-audio
TTS, STT, voice conversion using ElevenLabs, Whisper, RVC.
audio-visualization
Generate audio visualization videos using each::sense AI.
webchat-audio-notifications
Add browser audio notifications to Moltbot/Clawdbot webchat with 5 intensity levels - from whisper to impossible-to-miss (only when tab is backgrounded).
audio-transcribe
Auto-transcribe voice messages using faster-whisper (local, no API key needed).
DomainKits MCP Server
Domain intelligence tools through MCP-compatible clients.
paylock
Non-custodial SOL escrow for AI agent deals.