voice-stt-tts

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

3,891 stars

Best use case

voice-stt-tts is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

Teams using voice-stt-tts should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/voice-stt-tts/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/aksenkin/voice-stt-tts/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/voice-stt-tts/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How voice-stt-tts Compares

Feature / Agentvoice-stt-ttsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using **faster-whisper** for transcription and **Edge TTS** for voice replies.

## What we configure

- ✅ **STT** (Speech-to-Text) — transcribe voice messages via faster-whisper
- ✅ **TTS** (Text-to-Speech) — voice replies via Edge TTS
- 🎯 **Result:** voice → text → reply with voice

---

## Installation

### 1. Create virtual environment (venv)

For Ubuntu create an isolated venv:

```bash
python3 -m venv ~/.openclaw/workspace/voice-messages
```

### 2. Install faster-whisper

Install packages in venv:

```bash
~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper
```

**What gets installed:**
- `faster-whisper` — Python library for transcription
- Dependencies: `ctranslate2`, `onnxruntime`, `huggingface-hub`, `av`, `numpy`, and others.
- Size: ~250 MB

---

## Transcription Script

### Path and content

**File:** `~/.openclaw/workspace/voice-messages/transcribe.py`

```python
#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel


def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
    model = WhisperModel(
        model_name,
        device=device,
        compute_type="int8" if device == "cpu" else "float16",
    )
    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
    return text


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--audio", required=True)
    p.add_argument("--model", default="small")
    p.add_argument("--lang", default="en")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
    args = p.parse_args()

    text = transcribe(args.audio, args.model, args.lang, args.device)
    print(text if text else "")


if __name__ == "__main__":
    main()
```

**What the script does:**
1. Accepts audio file path (`--audio`)
2. Loads Whisper model (`--model`): `small` by default
3. Sets language (`--lang`): `en` for English
4. Transcribes with VAD filter (Voice Activity Detection)
5. Outputs clean text to stdout

### Make file executable:

```bash
chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py
```

---

## OpenClaw Configuration

### 1. Configure STT (`tools.media.audio`)

Add to `~/.openclaw/openclaw.json`:

```json5
{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}
```

**Parameters:**

| Parameter | Value | Description |
|-----------|----------|-----------|
| `enabled` | `true` | Enable audio transcription |
| `maxBytes` | `20971520` | Max file size (20 MB) |
| `type` | `"cli"` | Model type: CLI command |
| `command` | Python path | Path to python in venv |
| `args` | argument array | Arguments for script |
| `{{MediaPath}}` | placeholder | Replaced with audio file path |
| `timeoutSeconds` | `120` | Transcription timeout (2 minutes) |

### 2. Configure TTS (`messages.tts`)

Add to `~/.openclaw/openclaw.json`:

```json5
{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    }
  }
}
```

**Parameters:**

| Parameter | Value | Description |
|-----------|----------|-----------|
| `auto` | `"inbound"` | **Key mode!** — reply with voice only on incoming voice messages |
| `provider` | `"edge"` | TTS provider (free, no API key) |
| `voice` | `"en-US-JennyNeural"` | Voice (see available below) |
| `lang` | `"en-US"` | Locale (en-US for US english) |

### 3. Full configuration example

```json5
{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    },
  },
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    },
    "ackReactionScope": "group-mentions"
  }
}
```

---

## Apply Changes

### Restart Gateway

```bash
# Method 1: via openclaw CLI
openclaw gateway restart

# Method 2: via systemd
systemctl --user restart openclaw-gateway

# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)
```

---

## Testing

### Test STT (transcription)

**Action:** Send a voice message to your Telegram bot

**Expected result:**
```
[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>
```

**Example response:**
```
[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?
```

### Test TTS (voice replies)

**Action:** After successful transcription, bot should send a voice reply

**Expected result:**
- Voice file arrives in Telegram
- Voice note (round bubble)

**Expected behavior:**
- Incoming voice → bot replies with voice
- Text messages → bot replies with text (this is normal!)

---

## Available Edge TTS Voices

### Female voices

| Voice | ID | Usage example |
|--------|-----|------------------|
| Jenny | `en-US-JennyNeural` | ← current |
| Ana | `en-US-AnaNeural` | Softer |

### Male voices

| Voice | ID | Usage example |
|--------|-----|------------------|
| Dmitry | `en-US-RogerNeural` | More bass |

**How to change voice:**
```bash
cat ~/.openclaw/openclaw.json | \
  jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway
```

---

## Additional Edge TTS Parameters

### Adjusting speed, pitch, volume

```json5
{
  "messages": {
    "tts": {
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US",
        "rate": "+10%",      // Speed: -50% to +100%
        "pitch": "-5%",     // Pitch: -50% to +50%
        "volume": "+5%"     // Volume: -100% to +100%
      }
    }
  }
}
```

---

## Troubleshooting

### Problem: Voice not transcribed

**Logs show:**
```
[ERROR] Transcription failed
```

**Possible causes:**
1. **File too large** — > 20 MB
   ```bash
   # Solution: Increase maxBytes in config
   maxBytes: 52428800  # 50 MB
   ```

2. **Timeout** — transcription took > 2 minutes
   ```bash
   # Solution: Increase timeoutSeconds
   timeoutSeconds: 180  # 3 minutes
   ```

3. **Model not downloaded** — first run
   ```bash
   # Solution: Wait while it downloads (1-2 minutes)
   # Models are cached in ~/.cache/huggingface/
   ```

### Problem: No voice reply

**Possible causes:**
1. **Reply too short** (< 10 characters)
   - TTS skips very short replies
   - Solution: this is expected behavior

2. **auto: "inbound"** but text message
   - TTS in `inbound` mode replies with voice only on **voice messages**
   - Text messages get text replies — this is correct!

3. **Edge TTS unavailable**
   ```bash
   # Check
   curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100
   # If error — temporarily unavailable
   ```

---

## Performance

### Transcription time (Raspberry Pi 4/ARM)

| Whisper Model | Est. time | Quality |
|---------------|--------------|---------|
| `tiny` | ~5-10 sec | Low |
| `base` | ~10-20 sec | Medium |
| `small` | ~20-40 sec | High ← current |
| `medium` | ~40-80 sec | Very high |
| `large` | ~80-160 sec | Maximum |

**Recommendation:** For Raspberry Pi use `small` or `base`. `medium`/`large` will be very slow.

### Where Whisper models are stored

```bash
~/.cache/huggingface/
```

Models download automatically on first run.

## Done! 🎉

After completing these steps:

1. ✅ faster-whisper installed in venv
2. ✅ `transcribe.py` script created
3. ✅ OpenClaw configured (STT + TTS)
4. ✅ Gateway restarted
5. ✅ Voice messages working

Now your Telegram bot:
- 🎙️ **Accepts voice** → transcribes via faster-whisper
- 🎤 **Replies with voice** → generates via Edge TTS
- 💬 **Accepts text** → replies with text (as usual)

---

**Useful links:**
- OpenClaw docs: https://docs.openclaw.ai
- TTS docs: https://docs.openclaw.ai/tts
- Audio docs: https://docs.openclaw.ai/nodes/audio
- Install skills: `npx clawhub search voice`

---

*Created: 2026-03-01 for OpenClaw 2026.2.26*

Related Skills

Invoice Generator

3891
from openclaw/skills

Creates professional invoices in markdown and HTML

Workflow & Productivity

brand-voice-generator

3891
from openclaw/skills

Creates consistent brand voice guidelines and content. Generates copy that matches your brand personality across all channels. Perfect for startups building their identity.

Content & Documentation

invoice-ocr

3891
from openclaw/skills

发票 OCR 识别技能。扫描文件夹中的发票文件(PDF/图片),调用阿里云 OCR API 识别发票信息并导出到 Excel 表格。支持 17+ 种发票类型(增值税发票、火车票、出租车票、机票行程单、定额发票、机动车销售发票、过路过桥费发票等)。使用场景:(1) 用户提到"发票识别"、"发票统计"、"发票整理"、"发票汇总" (2) 用户需要批量处理发票 (3) 用户提到阿里云 OCR 识别发票。**重要:首次使用必须先配置阿里云凭证,主动向用户索要 AccessKey ID 和 AccessKey Secret,或引导用户运行 --config 命令自行配置。**

Workflow & Productivity

Bland AI — Voice Calling Skill

3891
from openclaw/skills

Make and manage AI-powered phone calls via the Bland AI API.

Workflow & Productivity

afrexai-invoice-engine

3880
from openclaw/skills

Generate, manage, and track professional invoices with payment terms, recurring billing, overdue automation, and financial reporting. Use when creating invoices, tracking payments, managing clients, or reviewing revenue.

Workflow & Productivity

voice-tts

3891
from openclaw/skills

语音输入(Whisper ASR)+ 语音输出(Edge TTS)技能,支持 agent 专属音色,可调用 send_voice_reply.mjs 发送 Telegram 语音消息。

amber-voice-assistant

3891
from openclaw/skills

AI phone assistant and virtual receptionist for OpenClaw. Answers inbound phone calls, screens callers, makes outbound phone calls, and books appointments — all over Twilio + OpenAI Realtime voice. Full telephone workflow: phone call screening, live call transcripts, CRM contact memory, calendar integration. Ideal for anyone who wants an AI to answer their phone, handle call screening, or make phone calls autonomously. Includes interactive setup wizard, live call dashboard, and human-in-the-loop escalation. Also ships as a Claude Desktop MCP plugin — dial phone numbers, check call history, query CRM, and manage calendar directly from Claude Desktop.

discord-voice

3891
from openclaw/skills

Real-time voice conversations in Discord voice channels with Claude AI

feishu-voice-assistant

3891
from openclaw/skills

Sends voice messages (audio) to Feishu chats using Duby TTS.

invoice-chaser

3891
from openclaw/skills

Automated invoice follow-up sequences that escalate from friendly to firm. Track unpaid invoices, send timed reminder emails with escalating tone, log payment interactions, and generate AR aging reports. Your agent handles the awkward conversations so you don't have to — preserving cash flow and client relationships while you focus on actual work. Configure invoice tracking, email templates per stage (friendly → firm → final notice), timing rules, and let your agent chase payments 24/7. Use when adding invoices, running payment chases, checking status, or generating accounts receivable reports.

voiceclaw

3891
from openclaw/skills

Local voice I/O for OpenClaw agents. Transcribe inbound audio/voice messages using local Whisper (whisper.cpp) and generate voice replies using local Piper TTS. Requires whisper, piper, and ffmpeg pre-installed on the system. All inference runs on-device — no network calls, no cloud APIs, no API keys. Use when an agent receives a voice/audio message and should respond in both voice and text, or when any text response should be synthesized and sent as audio. Triggers on: voice messages, audio attachments, respond in voice, send as audio, speak this, voiceclaw.

anvevoice

3891
from openclaw/skills

Add AI voice assistants to your website. Engage visitors with natural voice conversations, capture leads, automate support, and boost conversions.