edge-tts
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
About this skill
The Edge-TTS skill empowers AI agents to convert any given text into high-quality speech audio. It integrates with Microsoft Edge's powerful neural text-to-speech capabilities, ensuring natural-sounding output across various languages and voices. Agents can easily trigger this functionality to provide auditory responses, enhance content accessibility, or deliver information in a spoken format. Beyond basic text-to-speech, this skill offers robust customization options. Users can specify a preferred voice, language, and adjust parameters like speech rate, pitch, and volume to fine-tune the audio output. Additionally, it has the unique capability to generate subtitles alongside the audio, providing a comprehensive solution for multimedia content creation. This skill is ideal for AI applications requiring dynamic audio generation, such as voice assistants, accessibility tools, educational platforms, or any scenario where information needs to be consumed audibly rather than visually. It simplifies the process of creating engaging and accessible spoken content directly within the agent's workflow.
Best use case
The primary use case for the Edge-TTS skill is to provide dynamic, customizable text-to-speech output to users, enhancing accessibility and user experience by converting textual information into spoken words. It particularly benefits users who prefer or require auditory information, such as those multitasking, with visual impairments, or in contexts like driving or cooking where reading is impractical, making digital content more versatile.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
The user should expect to receive an audio file (e.g., MP3) of the provided text, potentially with accompanying subtitles, delivered via the current communication channel.
Practical example
Example input
Can you convert 'The quick brown fox jumps over the lazy dog.' to speech using a neutral US English voice and save the subtitles?
Example output
MEDIA: /path/to/the_quick_brown_fox.mp3 (subtitles saved to /path/to/the_quick_brown_fox.json)
When to use this skill
- When the user explicitly requests audio or voice output using triggers like 'tts'.
- When content needs to be consumed audibly for multitasking, accessibility, or specific scenarios (e.g., driving, cooking).
- When the user desires specific voice characteristics, speed, pitch, or subtitles for the TTS output.
- When an AI agent needs to provide spoken feedback or read out lengthy textual information.
When not to use this skill
- When the user specifically requests text-based output and no audio is desired.
- If the agent's operating environment does not support audio playback or file handling.
- For extremely short texts where the overhead of TTS conversion might not be justified.
- When local, offline TTS processing is required, as this skill relies on an external service.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/tts-1/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How edge-tts Compares
| Feature / Agent | edge-tts | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | easy | N/A |
Frequently Asked Questions
What does this skill do?
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
How difficult is it to install?
The installation complexity is rated as easy. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
Best AI Skills for ChatGPT
Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.
AI Agents for Freelancers
Browse AI agent skills for freelancers handling client research, proposals, outreach, delivery systems, documentation, and repeatable admin work.
SKILL.md Source
# Edge-TTS Skill
## Overview
Generate high-quality text-to-speech audio using Microsoft Edge's neural TTS service via the node-edge-tts npm package. Supports multiple languages, voices, adjustable speed/pitch, and subtitle generation.
## Quick Start
When you detect TTS intent from triggers or user request:
1. **Call the tts tool** (Clawdbot built-in) to convert text to speech
2. The tool returns a MEDIA: path
3. Clawdbot routes the audio to the current channel
```javascript
// Example: Built-in tts tool usage
tts("Your text to convert to speech")
// Returns: MEDIA: /path/to/audio.mp3
```
## Trigger Detection
Recognize "tts" keyword as TTS requests. The skill automatically filters out TTS-related keywords from text before conversion to avoid converting the trigger words themselves to audio.
## Advanced Customization
### Using the Node.js Scripts
For more control, use the bundled scripts directly:
#### TTS Converter
```bash
cd scripts
npm install
node tts-converter.js "Your text" --voice en-US-AriaNeural --rate +10% --output output.mp3
```
**Options:**
- `--voice, -v`: Voice name (default: en-US-AriaNeural)
- `--lang, -l`: Language code (e.g., en-US, es-ES)
- `--format, -o`: Output format (default: audio-24khz-48kbitrate-mono-mp3)
- `--pitch`: Pitch adjustment (e.g., +10%, -20%, default)
- `--rate, -r`: Rate adjustment (e.g., +10%, -20%, default)
- `--volume`: Volume adjustment (e.g., +0%, -10%, default)
- `--save-subtitles, -s`: Save subtitles as JSON file
- `--output, -f`: Output file path (default: tts_output.mp3)
- `--proxy, -p`: Proxy URL (e.g., http://localhost:7890)
- `--timeout`: Request timeout in milliseconds (default: 10000)
- `--list-voices, -L`: List available voices
#### Configuration Manager
```bash
cd scripts
npm install
node config-manager.js --set-voice en-US-AriaNeural
node config-manager.js --set-rate +10%
node config-manager.js --get
node config-manager.js --reset
```
### Voice Selection
Common voices (use `--list-voices` for full list):
**English:**
- `en-US-MichelleNeural` (female, natural, **default**)
- `en-US-AriaNeural` (female, natural)
- `en-US-GuyNeural` (male, natural)
- `en-GB-SoniaNeural` (female, British)
- `en-GB-RyanNeural` (male, British)
**Other Languages:**
- `es-ES-ElviraNeural` (Spanish, Spain)
- `fr-FR-DeniseNeural` (French)
- `de-DE-KatjaNeural` (German)
- `ja-JP-NanamiNeural` (Japanese)
- `zh-CN-XiaoxiaoNeural` (Chinese)
- `ar-SA-ZariyahNeural` (Arabic)
### Rate Guidelines
Rate values use percentage format:
- `"default"`: Normal speed
- `"-20%"` to `"-10%"`: Slow, clear (tutorials, stories, accessibility)
- `"+10%"` to `"+20%"`: Slightly fast (summaries)
- `"+30%"` to `"+50%"`: Fast (news, efficiency)
### Output Formats
Choose audio quality based on use case:
- `audio-24khz-48kbitrate-mono-mp3`: Standard quality (voice notes, messages)
- `audio-24khz-96kbitrate-mono-mp3`: High quality (presentations, content)
- `audio-48khz-96kbitrate-stereo-mp3`: Highest quality (professional audio, music)
## Resources
### scripts/tts-converter.js
Main TTS conversion script using node-edge-tts. Generates audio files with customizable voice, rate, volume, pitch, and format. Supports subtitle generation and voice listing.
### scripts/config-manager.js
Manages persistent user preferences for TTS settings (voice, language, format, pitch, rate, volume). Stores config in `~/.tts-config.json`.
### scripts/package.json
NPM package configuration with node-edge-tts dependency.
### references/node_edge_tts_guide.md
Complete documentation for node-edge-tts npm package including:
- Full voice list by language
- Prosody options (rate, pitch, volume)
- Usage examples (CLI and Module)
- Subtitle generation
- Output formats
- Best practices and limitations
### Voice Testing
Test different voices and preview audio quality at: https://tts.travisvn.com/
Refer to this when you need specific voice details or advanced features.
## Installation
To use the bundled scripts:
```bash
cd /home/user/clawd/skills/public/tts-skill/scripts
npm install
```
This installs:
- `node-edge-tts` - TTS library
- `commander` - CLI argument parsing
## Workflow
1. **Detect intent**: Check for "tts" trigger or keyword in user message
2. **Choose method**: Use built-in `tts` tool for simple requests, or `scripts/tts-converter.js` for customization
3. **Generate audio**: Convert the target text (message, search results, summary)
4. **Return to user**: The tts tool returns a MEDIA: path; Clawdbot handles delivery
## Testing
### Basic Test
Run the test script to verify TTS functionality:
```bash
cd /home/user/clawd/skills/public/edge-tts/scripts
npm test
```
This generates a test audio file and verifies the TTS service is working.
### Voice Testing
Test different voices and preview audio quality at: https://tts.travisvn.com/
### Integration Test
Use the built-in `tts` tool for quick testing:
```javascript
// Example: Test TTS with default settings
tts("This is a test of the TTS functionality.")
```
### Configuration Test
Verify configuration persistence:
```bash
cd /home/user/clawd/skills/public/edge-tts/scripts
node config-manager.js --get
node config-manager.js --set-voice en-US-GuyNeural
node config-manager.js --get
```
## Troubleshooting
- **Test connectivity**: Run `npm test` to check if TTS service is accessible
- **Check voice availability**: Use `node tts-converter.js --list-voices` to see available voices
- **Verify proxy settings**: If using proxy, test with `node tts-converter.js "test" --proxy http://localhost:7890`
- **Check audio output**: The test should generate `test-output.mp3` in the scripts directory
## Notes
- node-edge-tts uses Microsoft Edge's online TTS service (updated, working authentication)
- No API key needed (free service)
- Output is MP3 format by default
- Requires internet connection
- Supports subtitle generation (JSON format with word-level timing)
- **Temporary File Handling**: By default, audio files are saved to the system's temporary directory (`/tmp/edge-tts-temp/` on Unix, `C:\Users\<user>\AppData\Local\Temp\edge-tts-temp\` on Windows) with unique filenames (e.g., `tts_1234567890_abc123.mp3`). Files are not automatically deleted - the calling application (Clawdbot) should handle cleanup after use. You can specify a custom output path with the `--output` option if permanent storage is needed.
- **TTS keyword filtering**: The skill automatically filters out TTS-related keywords (tts, TTS, text-to-speech) from text before conversion to avoid converting the trigger words themselves to audio
- For repeated preferences, use `config-manager.js` to set defaults
- **Default voice**: `en-US-MichelleNeural` (female, natural)
- Neural voices (ending in `Neural`) provide higher quality than Standard voicesRelated Skills
find-skills
Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
filesystem
Advanced filesystem operations for listing files, searching content, batch processing, and directory analysis. Supports recursive search, file type filtering, size analysis, and batch operations like copy/move/delete. Use when you need to: list directory contents, search for files by name or content, analyze directory structures, perform batch file operations, or analyze file sizes and distribution.
Budget & Expense Tracker — AI Agent Financial Command Center
Track every dollar, enforce budgets, spot spending patterns, and build wealth — all through natural conversation with your AI agent.
yt-dlp
A robust CLI wrapper for yt-dlp to download videos, playlists, and audio from YouTube and thousands of other sites. Supports format selection, quality control, metadata embedding, and cookie authentication.
time-checker
Check accurate current time, date, and timezone information for any location worldwide using time.is. Use when the user asks "what time is it in X", "current time in Y", or needs to verify timezone offsets.
pihole-ctl
Manage and monitor local Pi-hole instance. Query FTL database for statistics (blocked ads, top clients) and control service via CLI. Use when user asks "how many ads blocked", "pihole status", or "update gravity".
mermaid-architect
Generate beautiful, hand-drawn Mermaid diagrams with robust syntax (quoted labels, ELK layout). Use this skill when the user asks for "diagram", "flowchart", "sequence diagram", or "visualize this process".
memory-cache
High-performance temporary storage system using Redis. Supports namespaced keys (mema:*), TTL management, and session context caching. Use for: (1) Saving agent state, (2) Caching API results, (3) Sharing data between sub-agents.
mema
Mema's personal brain - SQLite metadata index for documents and Redis short-term context buffer. Use for organizing workspace knowledge paths and managing ephemeral session state.
file-organizer-skill
Organize files in directories by grouping them into folders based on their extensions or date. Includes Dry-Run, Recursive, and Undo capabilities.
media-compress
Compress and convert images and videos using ffmpeg. Use when the user wants to reduce file size, change format, resize, or optimize media files. Handles common formats like JPG, PNG, WebP, MP4, MOV, WebM. Triggers on phrases like "compress image", "compress video", "reduce file size", "convert to webp/mp4", "resize image", "make image smaller", "batch compress", "optimize media".
ocr-local
Extract text from images using Tesseract.js OCR (100% local, no API key required). Supports Chinese (simplified/traditional) and English.