deapi-audio

Text-to-speech, voice cloning, voice design, and transcribe audio files via deAPI GPU network. Trigger on 'text to speech', 'TTS', 'generate voice', 'read aloud', 'voice clone', 'clone voice', 'voice design', 'design voice', 'custom voice', 'transcribe audio', 'STT'. For video/YouTube transcription use deapi-video instead.

3,891 stars

byopenclaw

View on GitHub Installation ↓

Best use case

deapi-audio is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using deapi-audio should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/deapi-audio/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/aleglowa/deapi-audio/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/deapi-audio/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How deapi-audio Compares

Feature / Agent	deapi-audio	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agent for YouTube Script Writing

Find AI agent skills for YouTube script writing, video research, content outlining, and repeatable channel production workflows.

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

AI Agents for Startups

Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.

SKILL.md Source

# deAPI Audio

Text-to-speech, voice cloning, voice design, and audio transcription via deAPI decentralized GPU network.

## Scripts

| Script | Use when... |
|--------|-------------|
| `scripts/text-to-speech.sh` | User wants to convert text to spoken audio |
| `scripts/voice-clone.sh` | User wants to clone/replicate a voice from a sample audio file |
| `scripts/voice-design.sh` | User wants to generate speech with a voice described in natural language |
| `scripts/speech-to-text.sh` | User wants to transcribe an audio file (AAC, MP3, OGG, WAV, WebM, FLAC, max 10MB) |

## Your config
! cat ${CLAUDE_SKILL_DIR}/config.json 2>/dev/null || echo "NOT_CONFIGURED"

If the config above is NOT_CONFIGURED, ask the user:
- What is your deAPI API key? (get one at https://deapi.ai, free $5 credit)

Then write the answer to ${CLAUDE_SKILL_DIR}/config.json as `{ "api_key": "their_key" }`.

Alternatively, the user can set the `DEAPI_API_KEY` environment variable directly, which takes priority over config.json.

## Gotchas

- For YouTube/video transcription, use the `deapi-video` skill instead. This skill handles audio-only files (.mp3, .wav, .m4a, .flac, .ogg).
- Three TTS models: `Kokoro` (default), `Chatterbox`, `Qwen3`. Use `--model Chatterbox` or `--model Qwen3` to switch.
- Kokoro: Voice ID format is `{lang}{gender}_{name}`. Language is auto-detected from voice prefix if `--lang` is omitted.
- Chatterbox: voice is always `default`, speed is fixed at `1`, supports 22 languages. Text limit 10-2000 chars.
- Kokoro: text limit 3-10001 chars. Long text may timeout — split into segments and generate separately.
- TTS output format defaults to mp3. WAV files are much larger but lossless.
- Kokoro: `speed` range is 0.5-2.0. Values outside this range cause errors.
- Qwen3 Voice Clone (`voice-clone.sh`): ref audio must be 5-15 seconds. Too short or too long degrades quality. Formats: MP3, WAV, FLAC, OGG, M4A. URLs are downloaded automatically.
- Qwen3 Voice Design (`voice-design.sh`): quality depends on the `--instruct` description. Encourage specific details: gender, age, accent, speaking style, emotion.
- Qwen3 models use full language names (`English`, `French`, etc.) NOT language codes. 10 supported languages: English, Italian, Spanish, Portuguese, Russian, French, German, Korean, Japanese, Chinese.
- Qwen3 TTS (`--model Qwen3`): 9 voices available, default `Vivian`. Chinese language lacks `Ryan` voice.
- Qwen3 text limit is 10-5000 chars. Speed is fixed at 1. Voice Clone and Voice Design use voice=`default`.
- Audio transcription accepts a local file path or URL (`--audio`). Formats: AAC, MP3, OGG, WAV, WebM, FLAC. Max 10 MB.
- Result URLs expire in 24 hours. Download promptly.

## Quick examples

```bash
# Basic TTS
bash scripts/text-to-speech.sh --text "Hello world"

# British voice
bash scripts/text-to-speech.sh --text "Good morning" --voice bf_emma

# Chatterbox model (multilingual)
bash scripts/text-to-speech.sh --model Chatterbox --text "Bonjour le monde" --lang fr

# Qwen3 model
bash scripts/text-to-speech.sh --model Qwen3 --text "Hello world" --voice Serena --lang English

# Clone a voice from a sample
bash scripts/voice-clone.sh --text "Hello, this is my cloned voice" --ref-audio /path/to/sample.mp3

# Clone with reference transcript for better accuracy
bash scripts/voice-clone.sh --text "Welcome to the show" --ref-audio /path/to/sample.wav --ref-text "This is the original transcript"

# Design a custom voice from description
bash scripts/voice-design.sh --text "Good morning everyone" --instruct "A warm, deep male voice with a slight British accent"

# Voice design in another language
bash scripts/voice-design.sh --text "Bonjour tout le monde" --instruct "A cheerful young female voice" --lang French

# Transcribe audio file (local or URL)
bash scripts/speech-to-text.sh --audio /path/to/recording.mp3
bash scripts/speech-to-text.sh --audio "https://example.com/podcast.mp3"
```

For the full voice list and language codes, see [references/voices.md](references/voices.md).

Related Skills

youtube-audio-download

3891

from openclaw/skills

Download YouTube video audio and convert to MP3. Supports age-restricted videos with cookies.

audio-play

3891

from openclaw/skills

Play audio files using Windows media player. Non-blocking execution.

audio-rename

3891

from openclaw/skills

Rename audio files with Chinese/special characters to simple English names for mlx-stt compatibility.

audiobooklm

3891

from openclaw/skills

提供有声书创作与音频能力（ABS 读写、音效/音频检索、二创、音色推荐、章节角色分析等），通过 HTTP Streamable MCP 调用。

qwen-audio-lab

3891

from openclaw/skills

Hybrid text-to-speech, reusable voice cloning, and narrated audio generation for macOS plus Aliyun Qwen. Use when the user wants to convert text into speech, clone and reuse a voice from a reference recording, generate narration files from plain text or text files, or create PPT speaker-note voiceovers.

deapi

3891

from openclaw/skills

AI media generation via deAPI. Transcribe YouTube/audio/video, generate images from text, text-to-speech, OCR, remove backgrounds, upscale images, create videos, generate embeddings. 10-20x cheaper than OpenAI/Replicate.

audio-summary Skill

3891

from openclaw/skills

音频/视频转文本总结助手。

Audio Transcription Skill

3891

from openclaw/skills

Auto-transcribe voice messages using faster-whisper (local, no API key needed).

audio-script-writer

3891

from openclaw/skills

Convert written medical content into podcast or video scripts optimized for audio delivery. Transforms academic papers, reports, and educational materials into engaging spoken-word formats with pronunciation guides, timing markers, and audio-friendly structure.

audio-to-text-and-video-to-text

3891

from openclaw/skills

Transcribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — including MP3, MP4, WAV, M4A, OGG, WEBM, MOV, AVI, FLAC, and more. Trigger this skill for any request involving: "transcribe", "convert audio to text", "speech to text", "get transcript of", "extract audio from video", "meeting notes from recording", "subtitles", "captions", or similar. Also trigger when the user uploads or references a media file and asks what was said, discussed, or mentioned in it. If unsure whether audio/video transcription is involved, use this skill.

u2-audio-file-transcriber

3891

from openclaw/skills

Transcribe audio files via UniCloud ASR (云知声语音识别, recorded audio → text) API from UniSound. Supports multiple formats, optimized for finance, customer service, and other domains.

name: u2-audio-file-transcriber

3891

from openclaw/skills

description: "Transcribe audio files via UniCloud ASR (云知声语音识别, recorded audio → text) API from UniSound. Supports multiple formats, optimized for finance, customer service, and other domains. 调用云知声语音识别服务转写音频文件，支持多种音频格式，适用于金融、客服等场景。Use when the user needs to transcribe recorded audio files, or asks for UniSound/云知声 audio file transcription. Do NOT use for real-time/streaming speech recognition, text-to-speech (TTS), or live captioning. 不适用于实时语音识别、语音合成(TTS)或直播字幕。"