mlx-audio-server

A fast, accurate, and fully local OpenAI-compatible API server for speech-to-text and text-to-speech, powered by MLX on Apple Silicon and open-source models.

7 stars

Best use case

mlx-audio-server is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

A fast, accurate, and fully local OpenAI-compatible API server for speech-to-text and text-to-speech, powered by MLX on Apple Silicon and open-source models.

Teams using mlx-audio-server should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/mlx-audio-server/SKILL.md --create-dirs "https://raw.githubusercontent.com/Demerzels-lab/elsamultiskillagent/main/public/skills/guoqiao/mlx-audio-server/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/mlx-audio-server/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How mlx-audio-server Compares

Feature / Agentmlx-audio-serverStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

A fast, accurate, and fully local OpenAI-compatible API server for speech-to-text and text-to-speech, powered by MLX on Apple Silicon and open-source models.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# MLX Audio Server

`mlx-audio`: The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.

This skill will run it as a OpenAI-compatible API server on macOS in background, and provide scripts/examples for AI agents to use the api.

Default Models:

- Speech-To-Text: `mlx-community/glm-asr-nano-2512-8bit`
- Text-To-Speech: `mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16`

The server will download these models when needed, so first run will be a bit slow.

More choices here: https://github.com/Blaizzy/mlx-audio?tab=readme-ov-file#supported-models

## Requirements

- `mlx`: macOS with Apple Silicon
- `brew`: used to install deps if not available

## Installation

```bash
bash ${baseDir}/install.sh
```
This script will:
- clone (forked) mlx-audio repo into `~/opt/mlx-audio`
- use `uv` to create a venv and install deps in it: `~/opt/mlx-audio/.venv`
- create a plist file to run mlx-audio server as a launchd service in background in user domain
- run as a OpenAI compatible API server, on port 8899 by default.

## Usage

STT/Speech-To-Text:
```bash
# input will be converted to wav with ffmpeg, if not yet.
# output will be transcript text only.
bash ${baseDir}/run_stt.sh <audio_or_video_path>
```

TTS/Text-To-Speech:
```bash
# audio will be saved into a tmp dir, with default name `speech.wav`, and print to stdout.
bash ${baseDir}/run_tts.sh "Hello, Human!"
# or you can specify a output dir
bash ${baseDir}/run_tts.sh "Hello, Human!" ./output
# output will be audio path only.
```
You can use both scripts directly, or as example/reference.

Related Skills

munger-observer

7
from Demerzels-lab/elsamultiskillagent

Daily wisdom review applying Charlie Munger's mental models to your work and thinking. Use when asked to review decisions, analyze thinking patterns, detect biases, apply mental models, do a "Munger review", or run the Munger Observer. Triggers on scheduled daily reviews or manual requests like "run munger observer", "review my thinking", "check for blind spots", or "apply mental models".

iyeque-audio-processing

7
from Demerzels-lab/elsamultiskillagent

Audio ingestion, analysis, transformation, and generation (Transcribe, TTS, VAD, Features).

audio-processing

7
from Demerzels-lab/elsamultiskillagent

Audio ingestion, analysis, transformation, and generation (Transcribe, TTS, VAD, Features).

homeserver

7
from Demerzels-lab/elsamultiskillagent

Homelab server management via homebutler CLI.

fosmvvm-serverrequest-test-generator

7
from Demerzels-lab/elsamultiskillagent

Generate ServerRequest tests using VaporTesting.

fosmvvm-serverrequest-generator

7
from Demerzels-lab/elsamultiskillagent

Generate FOSMVVM ServerRequest types for CRUD operations and client-server communication.

eachlabs-voice-audio

7
from Demerzels-lab/elsamultiskillagent

TTS, STT, voice conversion using ElevenLabs, Whisper, RVC.

audio-visualization

7
from Demerzels-lab/elsamultiskillagent

Generate audio visualization videos using each::sense AI.

webchat-audio-notifications

7
from Demerzels-lab/elsamultiskillagent

Add browser audio notifications to Moltbot/Clawdbot webchat with 5 intensity levels - from whisper to impossible-to-miss (only when tab is backgrounded).

audio-transcribe

7
from Demerzels-lab/elsamultiskillagent

Auto-transcribe voice messages using faster-whisper (local, no API key needed).

DomainKits MCP Server

7
from Demerzels-lab/elsamultiskillagent

Domain intelligence tools through MCP-compatible clients.

paylock

7
from Demerzels-lab/elsamultiskillagent

Non-custodial SOL escrow for AI agent deals.