mlx-audio-server

A fast, accurate, and fully local OpenAI-compatible API server for speech-to-text and text-to-speech, powered by MLX on Apple Silicon and open-source models.

7 stars

byDemerzels-lab

View on GitHub Installation ↓

Best use case

mlx-audio-server is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

A fast, accurate, and fully local OpenAI-compatible API server for speech-to-text and text-to-speech, powered by MLX on Apple Silicon and open-source models.

Teams using mlx-audio-server should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/mlx-audio-server/SKILL.md --create-dirs "https://raw.githubusercontent.com/Demerzels-lab/elsamultiskillagent/main/public/skills/guoqiao/mlx-audio-server/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/mlx-audio-server/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How mlx-audio-server Compares

Feature / Agent	mlx-audio-server	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

A fast, accurate, and fully local OpenAI-compatible API server for speech-to-text and text-to-speech, powered by MLX on Apple Silicon and open-source models.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# MLX Audio Server

`mlx-audio`: The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.

This skill will run it as a OpenAI-compatible API server on macOS in background, and provide scripts/examples for AI agents to use the api.

Default Models:

- Speech-To-Text: `mlx-community/glm-asr-nano-2512-8bit`
- Text-To-Speech: `mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16`

The server will download these models when needed, so first run will be a bit slow.

More choices here: https://github.com/Blaizzy/mlx-audio?tab=readme-ov-file#supported-models

## Requirements

- `mlx`: macOS with Apple Silicon
- `brew`: used to install deps if not available

## Installation

```bash
bash ${baseDir}/install.sh
```
This script will:
- clone (forked) mlx-audio repo into `~/opt/mlx-audio`
- use `uv` to create a venv and install deps in it: `~/opt/mlx-audio/.venv`
- create a plist file to run mlx-audio server as a launchd service in background in user domain
- run as a OpenAI compatible API server, on port 8899 by default.

## Usage

STT/Speech-To-Text:
```bash
# input will be converted to wav with ffmpeg, if not yet.
# output will be transcript text only.
bash ${baseDir}/run_stt.sh <audio_or_video_path>
```

TTS/Text-To-Speech:
```bash
# audio will be saved into a tmp dir, with default name `speech.wav`, and print to stdout.
bash ${baseDir}/run_tts.sh "Hello, Human!"
# or you can specify a output dir
bash ${baseDir}/run_tts.sh "Hello, Human!" ./output
# output will be audio path only.
```
You can use both scripts directly, or as example/reference.

Related Skills

munger-observer

from Demerzels-lab/elsamultiskillagent

Daily wisdom review applying Charlie Munger's mental models to your work and thinking. Use when asked to review decisions, analyze thinking patterns, detect biases, apply mental models, do a "Munger review", or run the Munger Observer. Triggers on scheduled daily reviews or manual requests like "run munger observer", "review my thinking", "check for blind spots", or "apply mental models".

iyeque-audio-processing

from Demerzels-lab/elsamultiskillagent

Audio ingestion, analysis, transformation, and generation (Transcribe, TTS, VAD, Features).

audio-processing

from Demerzels-lab/elsamultiskillagent

Audio ingestion, analysis, transformation, and generation (Transcribe, TTS, VAD, Features).

homeserver

from Demerzels-lab/elsamultiskillagent

Homelab server management via homebutler CLI.

fosmvvm-serverrequest-test-generator

from Demerzels-lab/elsamultiskillagent

Generate ServerRequest tests using VaporTesting.

fosmvvm-serverrequest-generator

from Demerzels-lab/elsamultiskillagent

Generate FOSMVVM ServerRequest types for CRUD operations and client-server communication.

eachlabs-voice-audio

from Demerzels-lab/elsamultiskillagent

TTS, STT, voice conversion using ElevenLabs, Whisper, RVC.

audio-visualization

from Demerzels-lab/elsamultiskillagent

Generate audio visualization videos using each::sense AI.

webchat-audio-notifications

from Demerzels-lab/elsamultiskillagent

Add browser audio notifications to Moltbot/Clawdbot webchat with 5 intensity levels - from whisper to impossible-to-miss (only when tab is backgrounded).

audio-transcribe

from Demerzels-lab/elsamultiskillagent

Auto-transcribe voice messages using faster-whisper (local, no API key needed).

DomainKits MCP Server

from Demerzels-lab/elsamultiskillagent

Domain intelligence tools through MCP-compatible clients.

paylock

from Demerzels-lab/elsamultiskillagent

Non-custodial SOL escrow for AI agent deals.