text-to-voice

Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma).

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

text-to-voice is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using text-to-voice should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/text-to-voice/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/text-to-voice/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/text-to-voice/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How text-to-voice Compares

Feature / Agent	text-to-voice	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Text-to-Voice with Kyutai Pocket TTS

Convert text to natural speech using Kyutai's Pocket TTS - a lightweight 100M parameter model that runs efficiently on CPU.

## Installation

```bash
pip install pocket-tts
# or use uvx to run without installing:
uvx pocket-tts generate
```

Requires Python 3.10+ and PyTorch 2.5+. GPU not required.

## CLI Usage

### Basic Generation

```bash
# Generate with defaults (saves to ./tts_output.wav)
uvx pocket-tts generate

# Specify text
pocket-tts generate --text "Hello, this is my message."

# Specify output file location
pocket-tts generate --text "Hello" --output-path ./audio/greeting.wav

# Full example with all common options
pocket-tts generate \
  --text "Welcome to the demo." \
  --voice alba \
  --output-path ./output/welcome.wav
```

### CLI Options

| Option | Default | Description |
|--------|---------|-------------|
| `--text` | "Hello world..." | Text to convert to speech |
| `--voice` | alba | Voice name, local file path, or HuggingFace URL |
| `--output-path` | `./tts_output.wav` | **Where to save the generated audio file** |
| `--temperature` | 0.7 | Generation temperature (higher = more expressive) |
| `--lsd-decode-steps` | 1 | Quality steps (higher = better quality, slower) |
| `--eos-threshold` | -4.0 | End detection threshold (lower = finish earlier) |
| `--frames-after-eos` | auto | Extra frames after end (each frame = 80ms) |
| `--device` | cpu | Device to use (cpu/cuda) |
| `-q, --quiet` | false | Disable logging output |

### Voice Selection (CLI)

```bash
# Use a pre-made voice by name
pocket-tts generate --voice alba --text "Hello"
pocket-tts generate --voice javert --text "Hello"

# Use a local audio file for voice cloning
pocket-tts generate --voice ./my_voice.wav --text "Hello"

# Use a voice from HuggingFace
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/merchant.wav" --text "Hello"
```

### Quality Tuning (CLI)

```bash
# Higher quality (more generation steps)
pocket-tts generate --lsd-decode-steps 5 --temperature 0.5 --output-path high_quality.wav

# More expressive/varied output
pocket-tts generate --temperature 1.0 --output-path expressive.wav

# Shorter output (finishes speaking earlier)
pocket-tts generate --eos-threshold -3.0 --output-path shorter.wav
```

### Local Web Server

For quick iteration with multiple voices/texts:

```bash
uvx pocket-tts serve
# Open http://localhost:8000
```

## Available Voices

Pre-made voices (use name directly with `--voice`):

| Voice | Gender | License | Description |
|-------|--------|---------|-------------|
| `alba` | Female | CC BY 4.0 | Casual voice |
| `marius` | Male | CC0 | Voice donation |
| `javert` | Male | CC0 | Voice donation |
| `jean` | Male | CC-NC | EARS dataset |
| `fantine` | Female | CC BY 4.0 | VCTK dataset |
| `cosette` | Female | CC-NC | Expresso dataset |
| `eponine` | Female | CC BY 4.0 | VCTK dataset |
| `azelma` | Female | CC BY 4.0 | VCTK dataset |

Full voice catalog: https://huggingface.co/kyutai/tts-voices

For detailed voice information, see [references/voices.md](references/voices.md).

## Voice Cloning

Clone any voice from an audio sample. For best results:
- Use clean audio (minimal background noise)
- 10+ seconds recommended
- Consider [Adobe Podcast Enhance](https://podcast.adobe.com/en/enhance) to clean samples

```bash
pocket-tts generate --voice ./my_recording.wav --text "Hello" --output-path cloned.wav
```

## Output Format

- Sample Rate: 24kHz
- Channels: Mono
- Format: 16-bit PCM WAV
- Default location: `./tts_output.wav`

## Python API

For programmatic use:

```python
from pocket_tts import TTSModel
import scipy.io.wavfile

tts_model = TTSModel.load_model()
voice_state = tts_model.get_state_for_audio_prompt("alba")
audio = tts_model.generate_audio(voice_state, "Hello world!")

# Save to specific location
scipy.io.wavfile.write("./audio/output.wav", tts_model.sample_rate, audio.numpy())
```

### TTSModel.load_model()

```python
model = TTSModel.load_model(
    variant="b6369a24",      # Model variant
    temp=0.7,                # Temperature (0.0-1.0)
    lsd_decode_steps=1,      # Generation steps
    noise_clamp=None,        # Max noise value
    eos_threshold=-4.0       # End-of-sequence threshold
)
```

### Voice State

```python
# Pre-made voice
voice_state = model.get_state_for_audio_prompt("alba")

# Local file
voice_state = model.get_state_for_audio_prompt("./my_voice.wav")

# HuggingFace
voice_state = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")
```

### Generate Audio

```python
audio = model.generate_audio(voice_state, "Text to speak")
# Returns: torch.Tensor (1D)
```

### Streaming

```python
for chunk in model.generate_audio_stream(voice_state, "Long text..."):
    # Process each chunk as it's generated
    pass
```

### Properties

- `model.sample_rate` - 24000 Hz
- `model.device` - "cpu" or "cuda"

## Performance

- ~200ms latency to first audio chunk
- ~6x real-time on MacBook Air M4 CPU
- Uses only 2 CPU cores

## Limitations

- English only
- No built-in pause/silence control

Related Skills

voice-email

from diegosouzapw/awesome-omni-skill

Send emails via natural voice commands - designed for accessibility

voice-agents

from diegosouzapw/awesome-omni-skill

Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flo...

recursive-context-coding-agent

from diegosouzapw/awesome-omni-skill

Use recursive context processing with grep/find/uv to handle large codebases. When working with codebases larger than your context window, treat the codebase as an external environment and recursively process it using symbolic execution.

extracting-ai-context

from diegosouzapw/awesome-omni-skill

Extracts and manages AI context (skills, AGENTS.md) from workflow-kotlin library JARs. Use when setting up AI tooling for a workflow-kotlin project, updating skills after a library version change, or configuring agent-specific directories.

create-agent-with-sanity-context

from diegosouzapw/awesome-omni-skill

Build AI agents with structured access to Sanity content via Context MCP. Covers Studio setup, agent implementation, and advanced patterns like client-side tools and custom rendering.

context-optimizer

from diegosouzapw/awesome-omni-skill

Analyzes Copilot Chat debug logs, agent definitions, skills, and instruction files to audit context window utilization. Provides log parsing, turn-cost profiling, redundancy detection, hand-off gap analysis, and optimization recommendations. Use when optimizing agent context efficiency, identifying where to add subagent hand-offs, or reducing token waste across agent systems.

context-fundamentals

from diegosouzapw/awesome-omni-skill

Understand the components, mechanics, and constraints of context in agent systems. Use when designing agent architectures, debugging context-related failures, or optimizing context usage.

context-engineering

from diegosouzapw/awesome-omni-skill

Use when designing agent system prompts, optimizing RAG retrieval, or when context is too expensive or slow. Reduces tokens while maintaining quality through strategic positioning and attention-aware design.

context-degradation

from diegosouzapw/awesome-omni-skill

Recognize patterns of context failure: lost-in-middle, poisoning, distraction, and clash

context-assembler

from diegosouzapw/awesome-omni-skill

Assembles relevant context for agent spawns with prioritized ranking. Ranks packages by relevance, enforces token budgets with graduated zones, captures error patterns for learning, and supports configurable per-agent retrieval limits.

Codebase context

from diegosouzapw/awesome-omni-skill

Create a lightweight codebase_context.md that anchors the idea in the existing repo (modules, constraints, extension points). Generic framework prompt.

alttext-ai-automation

from diegosouzapw/awesome-omni-skill

Automate Alttext AI tasks via Rube MCP (Composio). Always search tools first for current schemas.