text-to-voice

Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma).

16 stars

Best use case

text-to-voice is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma).

Teams using text-to-voice should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/text-to-voice/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/text-to-voice/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/text-to-voice/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How text-to-voice Compares

Feature / Agenttext-to-voiceStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma).

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Text-to-Voice with Kyutai Pocket TTS

Convert text to natural speech using Kyutai's Pocket TTS - a lightweight 100M parameter model that runs efficiently on CPU.

## Installation

```bash
pip install pocket-tts
# or use uvx to run without installing:
uvx pocket-tts generate
```

Requires Python 3.10+ and PyTorch 2.5+. GPU not required.

## CLI Usage

### Basic Generation

```bash
# Generate with defaults (saves to ./tts_output.wav)
uvx pocket-tts generate

# Specify text
pocket-tts generate --text "Hello, this is my message."

# Specify output file location
pocket-tts generate --text "Hello" --output-path ./audio/greeting.wav

# Full example with all common options
pocket-tts generate \
  --text "Welcome to the demo." \
  --voice alba \
  --output-path ./output/welcome.wav
```

### CLI Options

| Option | Default | Description |
|--------|---------|-------------|
| `--text` | "Hello world..." | Text to convert to speech |
| `--voice` | alba | Voice name, local file path, or HuggingFace URL |
| `--output-path` | `./tts_output.wav` | **Where to save the generated audio file** |
| `--temperature` | 0.7 | Generation temperature (higher = more expressive) |
| `--lsd-decode-steps` | 1 | Quality steps (higher = better quality, slower) |
| `--eos-threshold` | -4.0 | End detection threshold (lower = finish earlier) |
| `--frames-after-eos` | auto | Extra frames after end (each frame = 80ms) |
| `--device` | cpu | Device to use (cpu/cuda) |
| `-q, --quiet` | false | Disable logging output |

### Voice Selection (CLI)

```bash
# Use a pre-made voice by name
pocket-tts generate --voice alba --text "Hello"
pocket-tts generate --voice javert --text "Hello"

# Use a local audio file for voice cloning
pocket-tts generate --voice ./my_voice.wav --text "Hello"

# Use a voice from HuggingFace
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/merchant.wav" --text "Hello"
```

### Quality Tuning (CLI)

```bash
# Higher quality (more generation steps)
pocket-tts generate --lsd-decode-steps 5 --temperature 0.5 --output-path high_quality.wav

# More expressive/varied output
pocket-tts generate --temperature 1.0 --output-path expressive.wav

# Shorter output (finishes speaking earlier)
pocket-tts generate --eos-threshold -3.0 --output-path shorter.wav
```

### Local Web Server

For quick iteration with multiple voices/texts:

```bash
uvx pocket-tts serve
# Open http://localhost:8000
```

## Available Voices

Pre-made voices (use name directly with `--voice`):

| Voice | Gender | License | Description |
|-------|--------|---------|-------------|
| `alba` | Female | CC BY 4.0 | Casual voice |
| `marius` | Male | CC0 | Voice donation |
| `javert` | Male | CC0 | Voice donation |
| `jean` | Male | CC-NC | EARS dataset |
| `fantine` | Female | CC BY 4.0 | VCTK dataset |
| `cosette` | Female | CC-NC | Expresso dataset |
| `eponine` | Female | CC BY 4.0 | VCTK dataset |
| `azelma` | Female | CC BY 4.0 | VCTK dataset |

Full voice catalog: https://huggingface.co/kyutai/tts-voices

For detailed voice information, see [references/voices.md](references/voices.md).

## Voice Cloning

Clone any voice from an audio sample. For best results:
- Use clean audio (minimal background noise)
- 10+ seconds recommended
- Consider [Adobe Podcast Enhance](https://podcast.adobe.com/en/enhance) to clean samples

```bash
pocket-tts generate --voice ./my_recording.wav --text "Hello" --output-path cloned.wav
```

## Output Format

- Sample Rate: 24kHz
- Channels: Mono
- Format: 16-bit PCM WAV
- Default location: `./tts_output.wav`

## Python API

For programmatic use:

```python
from pocket_tts import TTSModel
import scipy.io.wavfile

tts_model = TTSModel.load_model()
voice_state = tts_model.get_state_for_audio_prompt("alba")
audio = tts_model.generate_audio(voice_state, "Hello world!")

# Save to specific location
scipy.io.wavfile.write("./audio/output.wav", tts_model.sample_rate, audio.numpy())
```

### TTSModel.load_model()

```python
model = TTSModel.load_model(
    variant="b6369a24",      # Model variant
    temp=0.7,                # Temperature (0.0-1.0)
    lsd_decode_steps=1,      # Generation steps
    noise_clamp=None,        # Max noise value
    eos_threshold=-4.0       # End-of-sequence threshold
)
```

### Voice State

```python
# Pre-made voice
voice_state = model.get_state_for_audio_prompt("alba")

# Local file
voice_state = model.get_state_for_audio_prompt("./my_voice.wav")

# HuggingFace
voice_state = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")
```

### Generate Audio

```python
audio = model.generate_audio(voice_state, "Text to speak")
# Returns: torch.Tensor (1D)
```

### Streaming

```python
for chunk in model.generate_audio_stream(voice_state, "Long text..."):
    # Process each chunk as it's generated
    pass
```

### Properties

- `model.sample_rate` - 24000 Hz
- `model.device` - "cpu" or "cuda"

## Performance

- ~200ms latency to first audio chunk
- ~6x real-time on MacBook Air M4 CPU
- Uses only 2 CPU cores

## Limitations

- English only
- No built-in pause/silence control

Related Skills

voice-email

16
from diegosouzapw/awesome-omni-skill

Send emails via natural voice commands - designed for accessibility

voice-agents

16
from diegosouzapw/awesome-omni-skill

Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flo...

recursive-context-coding-agent

16
from diegosouzapw/awesome-omni-skill

Use recursive context processing with grep/find/uv to handle large codebases. When working with codebases larger than your context window, treat the codebase as an external environment and recursively process it using symbolic execution.

extracting-ai-context

16
from diegosouzapw/awesome-omni-skill

Extracts and manages AI context (skills, AGENTS.md) from workflow-kotlin library JARs. Use when setting up AI tooling for a workflow-kotlin project, updating skills after a library version change, or configuring agent-specific directories.

create-agent-with-sanity-context

16
from diegosouzapw/awesome-omni-skill

Build AI agents with structured access to Sanity content via Context MCP. Covers Studio setup, agent implementation, and advanced patterns like client-side tools and custom rendering.

context-optimizer

16
from diegosouzapw/awesome-omni-skill

Analyzes Copilot Chat debug logs, agent definitions, skills, and instruction files to audit context window utilization. Provides log parsing, turn-cost profiling, redundancy detection, hand-off gap analysis, and optimization recommendations. Use when optimizing agent context efficiency, identifying where to add subagent hand-offs, or reducing token waste across agent systems.

context-fundamentals

16
from diegosouzapw/awesome-omni-skill

Understand the components, mechanics, and constraints of context in agent systems. Use when designing agent architectures, debugging context-related failures, or optimizing context usage.

context-engineering

16
from diegosouzapw/awesome-omni-skill

Use when designing agent system prompts, optimizing RAG retrieval, or when context is too expensive or slow. Reduces tokens while maintaining quality through strategic positioning and attention-aware design.

context-degradation

16
from diegosouzapw/awesome-omni-skill

Recognize patterns of context failure: lost-in-middle, poisoning, distraction, and clash

context-assembler

16
from diegosouzapw/awesome-omni-skill

Assembles relevant context for agent spawns with prioritized ranking. Ranks packages by relevance, enforces token budgets with graduated zones, captures error patterns for learning, and supports configurable per-agent retrieval limits.

Codebase context

16
from diegosouzapw/awesome-omni-skill

Create a lightweight codebase_context.md that anchors the idea in the existing repo (modules, constraints, extension points). Generic framework prompt.

alttext-ai-automation

16
from diegosouzapw/awesome-omni-skill

Automate Alttext AI tasks via Rube MCP (Composio). Always search tools first for current schemas.