qwen-audio-lab

Hybrid text-to-speech, reusable voice cloning, and narrated audio generation for macOS plus Aliyun Qwen. Use when the user wants to convert text into speech, clone and reuse a voice from a reference recording, generate narration files from plain text or text files, or create PPT speaker-note voiceovers.

3,891 stars

byopenclaw

View on GitHub Installation ↓

Best use case

qwen-audio-lab is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using qwen-audio-lab should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/qwen-audio-lab/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/aliyx/qwen-audio-lab/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/qwen-audio-lab/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How qwen-audio-lab Compares

Feature / Agent	qwen-audio-lab	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

AI Agents for Startups

Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

SKILL.md Source

# Qwen Audio Lab

Use this skill for text-to-speech on macOS or with Aliyun Qwen.

## Choose the backend

- Use `mac-say` for fast local playback, notifications, and low-friction speech on a Mac.
- Use `qwen-tts` when the user wants better naturalness, reusable output files, custom voices, or voice cloning.
- If `DASHSCOPE_API_KEY` is missing, fall back to `mac-say` for local playback.

## Environment

- `DASHSCOPE_API_KEY`: required for Qwen synthesis and voice cloning.
- `QWEN_AUDIO_REGION`: optional, `cn` (default) or `intl`.
- `QWEN_AUDIO_OUTPUT_DIR`: optional directory for generated audio files. Defaults to `~/.openclaw/data/qwen-audio-lab/output`.
- `QWEN_AUDIO_STATE_DIR`: optional directory for local state such as remembered voices. Defaults to `~/.openclaw/data/qwen-audio-lab/state`.

## Commands

Run all commands through:

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py <command> [...]
```


## Preferred high-level commands

Use these first for most user-facing narration tasks:

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py narrate-text --text "这是要转成语音的正文"
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py narrate-file --text-file /path/to/script.txt
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py narrate-ppt --ppt /path/to/file.pptx
```

Use the older commands only when you specifically want the legacy workflow names.
Generated audio and remembered voice state now default to `~/.openclaw/data/qwen-audio-lab/` instead of the skill folder.

### Local macOS speech

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py mac-say \
  --text "开会了，别忘了带电脑" \
  --voice Tingting
```

### Qwen TTS from inline text

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py qwen-tts \
  --text "你好，我是你的语音助手。" \
  --voice Cherry \
  --model qwen3-tts-flash \
  --language-type Chinese \
  --download
```

### Qwen TTS from a text file

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py qwen-tts \
  --text-file /path/to/script.txt \
  --voice Cherry \
  --download
```

### Qwen TTS from stdin

```bash
cat /path/to/script.txt | python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py qwen-tts \
  --stdin \
  --voice Cherry \
  --download
```

### Clone a voice

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py clone-voice \
  --audio /path/to/reference.mp3 \
  --name claw-voice-01 \
  --target-model qwen3-tts-vc-2026-01-22
```

- Keep the cloning `target-model` aligned with the synthesis model family.
- Use a clean speech sample with minimal background noise.
- Ask before cloning a third party voice when consent is unclear.

### Design a voice from a text prompt

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py design-voice \
  --prompt "沉稳的中年男性播音员，音色低沉浑厚，适合纪录片旁白。" \
  --name doc-voice-01 \
  --target-model qwen3-tts-vd-2026-01-26 \
  --preview-format wav
```

### Legacy command: reuse the latest cloned voice

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py speak-last-cloned \
  --text "你好，这是我的声音测试。" \
  --download
```

### High-level narration from any text source

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py narrate-text \
  --text "这是要转成语音的正文" \
  --output narration.wav

python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py narrate-file \
  --text-file /path/to/script.txt
```

- Default voice source is `last-cloned`.
- Use `--voice-source last-designed` to use the latest designed voice instead.
- Use `--voice` and optionally `--model` to force a specific voice id and synthesis model.

### Legacy command: narrate PPT speaker notes with the latest cloned voice

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py ppt-own-voice   --ppt "/path/to/file.pptx"
```

### High-level PPT narration

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py narrate-ppt   --ppt "/path/to/file.pptx"
```

- Default voice source is `last-cloned`.
- Use `--voice-source last-designed` to switch to the latest designed voice.
- Use `--voice` and optionally `--model` to force a specific voice id and synthesis model.
- Keep `ppt-own-voice` as the backward-compatible alias for the original workflow.

### Inspect or manage remembered voices

```bash
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py list-voices
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py show-last-voice --kind cloned
python3 ~/.openclaw/skills/qwen-audio-lab/scripts/qwen_audio.py delete-voice --voice claw-voice-01
```

## Workflow rules

- Reuse an existing cloned voice before asking for a new sample.
- Ask for a reference recording if the user wants their own voice and no cloned voice exists yet.
- Prefer the `narrate-*` commands as the primary high-level interface for narration tasks.
- Keep `speak-last-cloned` and `ppt-own-voice` for backward compatibility with older workflows.
- Keep only final outputs by default after segmented synthesis unless the user explicitly asks to keep fragments.

Related Skills

youtube-audio-download

3891

from openclaw/skills

Download YouTube video audio and convert to MP3. Supports age-restricted videos with cookies.

audio-play

3891

from openclaw/skills

Play audio files using Windows media player. Non-blocking execution.

audio-rename

3891

from openclaw/skills

Rename audio files with Chinese/special characters to simple English names for mlx-stt compatibility.

audiobooklm

3891

from openclaw/skills

提供有声书创作与音频能力（ABS 读写、音效/音频检索、二创、音色推荐、章节角色分析等），通过 HTTP Streamable MCP 调用。

deapi-audio

3891

from openclaw/skills

Text-to-speech, voice cloning, voice design, and transcribe audio files via deAPI GPU network. Trigger on 'text to speech', 'TTS', 'generate voice', 'read aloud', 'voice clone', 'clone voice', 'voice design', 'design voice', 'custom voice', 'transcribe audio', 'STT'. For video/YouTube transcription use deapi-video instead.

audio-summary Skill

3891

from openclaw/skills

音频/视频转文本总结助手。

qwen-asr

3891

from openclaw/skills

Transcribe audio files using Qwen ASR (千问STT). Use when the user sends voice messages and wants them converted to text.

Audio Transcription Skill

3891

from openclaw/skills

Auto-transcribe voice messages using faster-whisper (local, no API key needed).

audio-script-writer

3891

from openclaw/skills

Convert written medical content into podcast or video scripts optimized for audio delivery. Transforms academic papers, reports, and educational materials into engaging spoken-word formats with pronunciation guides, timing markers, and audio-friendly structure.

audio-to-text-and-video-to-text

3891

from openclaw/skills

Transcribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — including MP3, MP4, WAV, M4A, OGG, WEBM, MOV, AVI, FLAC, and more. Trigger this skill for any request involving: "transcribe", "convert audio to text", "speech to text", "get transcript of", "extract audio from video", "meeting notes from recording", "subtitles", "captions", or similar. Also trigger when the user uploads or references a media file and asks what was said, discussed, or mentioned in it. If unsure whether audio/video transcription is involved, use this skill.

u2-audio-file-transcriber

3891

from openclaw/skills

Transcribe audio files via UniCloud ASR (云知声语音识别, recorded audio → text) API from UniSound. Supports multiple formats, optimized for finance, customer service, and other domains.

name: u2-audio-file-transcriber

3891

from openclaw/skills

description: "Transcribe audio files via UniCloud ASR (云知声语音识别, recorded audio → text) API from UniSound. Supports multiple formats, optimized for finance, customer service, and other domains. 调用云知声语音识别服务转写音频文件，支持多种音频格式，适用于金融、客服等场景。Use when the user needs to transcribe recorded audio files, or asks for UniSound/云知声 audio file transcription. Do NOT use for real-time/streaming speech recognition, text-to-speech (TTS), or live captioning. 不适用于实时语音识别、语音合成(TTS)或直播字幕。"