text-to-voice
Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma).
Best use case
text-to-voice is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma).
Teams using text-to-voice should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/text-to-voice/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How text-to-voice Compares
| Feature / Agent | text-to-voice | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma).
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Text-to-Voice with Kyutai Pocket TTS
Convert text to natural speech using Kyutai's Pocket TTS - a lightweight 100M parameter model that runs efficiently on CPU.
## Installation
```bash
pip install pocket-tts
# or use uvx to run without installing:
uvx pocket-tts generate
```
Requires Python 3.10+ and PyTorch 2.5+. GPU not required.
## CLI Usage
### Basic Generation
```bash
# Generate with defaults (saves to ./tts_output.wav)
uvx pocket-tts generate
# Specify text
pocket-tts generate --text "Hello, this is my message."
# Specify output file location
pocket-tts generate --text "Hello" --output-path ./audio/greeting.wav
# Full example with all common options
pocket-tts generate \
--text "Welcome to the demo." \
--voice alba \
--output-path ./output/welcome.wav
```
### CLI Options
| Option | Default | Description |
|--------|---------|-------------|
| `--text` | "Hello world..." | Text to convert to speech |
| `--voice` | alba | Voice name, local file path, or HuggingFace URL |
| `--output-path` | `./tts_output.wav` | **Where to save the generated audio file** |
| `--temperature` | 0.7 | Generation temperature (higher = more expressive) |
| `--lsd-decode-steps` | 1 | Quality steps (higher = better quality, slower) |
| `--eos-threshold` | -4.0 | End detection threshold (lower = finish earlier) |
| `--frames-after-eos` | auto | Extra frames after end (each frame = 80ms) |
| `--device` | cpu | Device to use (cpu/cuda) |
| `-q, --quiet` | false | Disable logging output |
### Voice Selection (CLI)
```bash
# Use a pre-made voice by name
pocket-tts generate --voice alba --text "Hello"
pocket-tts generate --voice javert --text "Hello"
# Use a local audio file for voice cloning
pocket-tts generate --voice ./my_voice.wav --text "Hello"
# Use a voice from HuggingFace
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/merchant.wav" --text "Hello"
```
### Quality Tuning (CLI)
```bash
# Higher quality (more generation steps)
pocket-tts generate --lsd-decode-steps 5 --temperature 0.5 --output-path high_quality.wav
# More expressive/varied output
pocket-tts generate --temperature 1.0 --output-path expressive.wav
# Shorter output (finishes speaking earlier)
pocket-tts generate --eos-threshold -3.0 --output-path shorter.wav
```
### Local Web Server
For quick iteration with multiple voices/texts:
```bash
uvx pocket-tts serve
# Open http://localhost:8000
```
## Available Voices
Pre-made voices (use name directly with `--voice`):
| Voice | Gender | License | Description |
|-------|--------|---------|-------------|
| `alba` | Female | CC BY 4.0 | Casual voice |
| `marius` | Male | CC0 | Voice donation |
| `javert` | Male | CC0 | Voice donation |
| `jean` | Male | CC-NC | EARS dataset |
| `fantine` | Female | CC BY 4.0 | VCTK dataset |
| `cosette` | Female | CC-NC | Expresso dataset |
| `eponine` | Female | CC BY 4.0 | VCTK dataset |
| `azelma` | Female | CC BY 4.0 | VCTK dataset |
Full voice catalog: https://huggingface.co/kyutai/tts-voices
For detailed voice information, see [references/voices.md](references/voices.md).
## Voice Cloning
Clone any voice from an audio sample. For best results:
- Use clean audio (minimal background noise)
- 10+ seconds recommended
- Consider [Adobe Podcast Enhance](https://podcast.adobe.com/en/enhance) to clean samples
```bash
pocket-tts generate --voice ./my_recording.wav --text "Hello" --output-path cloned.wav
```
## Output Format
- Sample Rate: 24kHz
- Channels: Mono
- Format: 16-bit PCM WAV
- Default location: `./tts_output.wav`
## Python API
For programmatic use:
```python
from pocket_tts import TTSModel
import scipy.io.wavfile
tts_model = TTSModel.load_model()
voice_state = tts_model.get_state_for_audio_prompt("alba")
audio = tts_model.generate_audio(voice_state, "Hello world!")
# Save to specific location
scipy.io.wavfile.write("./audio/output.wav", tts_model.sample_rate, audio.numpy())
```
### TTSModel.load_model()
```python
model = TTSModel.load_model(
variant="b6369a24", # Model variant
temp=0.7, # Temperature (0.0-1.0)
lsd_decode_steps=1, # Generation steps
noise_clamp=None, # Max noise value
eos_threshold=-4.0 # End-of-sequence threshold
)
```
### Voice State
```python
# Pre-made voice
voice_state = model.get_state_for_audio_prompt("alba")
# Local file
voice_state = model.get_state_for_audio_prompt("./my_voice.wav")
# HuggingFace
voice_state = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")
```
### Generate Audio
```python
audio = model.generate_audio(voice_state, "Text to speak")
# Returns: torch.Tensor (1D)
```
### Streaming
```python
for chunk in model.generate_audio_stream(voice_state, "Long text..."):
# Process each chunk as it's generated
pass
```
### Properties
- `model.sample_rate` - 24000 Hz
- `model.device` - "cpu" or "cuda"
## Performance
- ~200ms latency to first audio chunk
- ~6x real-time on MacBook Air M4 CPU
- Uses only 2 CPU cores
## Limitations
- English only
- No built-in pause/silence controlRelated Skills
voice-email
Send emails via natural voice commands - designed for accessibility
voice-agents
Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flo...
recursive-context-coding-agent
Use recursive context processing with grep/find/uv to handle large codebases. When working with codebases larger than your context window, treat the codebase as an external environment and recursively process it using symbolic execution.
extracting-ai-context
Extracts and manages AI context (skills, AGENTS.md) from workflow-kotlin library JARs. Use when setting up AI tooling for a workflow-kotlin project, updating skills after a library version change, or configuring agent-specific directories.
create-agent-with-sanity-context
Build AI agents with structured access to Sanity content via Context MCP. Covers Studio setup, agent implementation, and advanced patterns like client-side tools and custom rendering.
context-optimizer
Analyzes Copilot Chat debug logs, agent definitions, skills, and instruction files to audit context window utilization. Provides log parsing, turn-cost profiling, redundancy detection, hand-off gap analysis, and optimization recommendations. Use when optimizing agent context efficiency, identifying where to add subagent hand-offs, or reducing token waste across agent systems.
context-fundamentals
Understand the components, mechanics, and constraints of context in agent systems. Use when designing agent architectures, debugging context-related failures, or optimizing context usage.
context-engineering
Use when designing agent system prompts, optimizing RAG retrieval, or when context is too expensive or slow. Reduces tokens while maintaining quality through strategic positioning and attention-aware design.
context-degradation
Recognize patterns of context failure: lost-in-middle, poisoning, distraction, and clash
context-assembler
Assembles relevant context for agent spawns with prioritized ranking. Ranks packages by relevance, enforces token budgets with graduated zones, captures error patterns for learning, and supports configurable per-agent retrieval limits.
Codebase context
Create a lightweight codebase_context.md that anchors the idea in the existing repo (modules, constraints, extension points). Generic framework prompt.
alttext-ai-automation
Automate Alttext AI tasks via Rube MCP (Composio). Always search tools first for current schemas.