voice-audio-engineer

Expert in voice synthesis, TTS, voice cloning, podcast production, speech processing, and voice UI design via ElevenLabs integration. Specializes in vocal clarity, loudness standards (LUFS), de-essing, dialogue mixing, and voice transformation. Activate on 'TTS', 'text-to-speech', 'voice clone', 'voice synthesis', 'ElevenLabs', 'podcast', 'voice recording', 'speech-to-speech', 'voice UI', 'audiobook', 'dialogue'. NOT for spatial audio (use sound-engineer), music production (use DAW tools), game audio middleware (use sound-engineer), sound effects generation (use sound-engineer with ElevenLabs SFX), or live concert audio.

85 stars

bycuriositech

View on GitHub Installation ↓

Best use case

voice-audio-engineer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using voice-audio-engineer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/voice-audio-engineer/SKILL.md --create-dirs "https://raw.githubusercontent.com/curiositech/some_claude_skills/main/.claude/skills/voice-audio-engineer/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/voice-audio-engineer/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How voice-audio-engineer Compares

Feature / Agent	voice-audio-engineer	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Voice & Audio Engineer: Voice Synthesis, TTS & Speech Processing

Expert in voice synthesis, speech processing, and vocal production using ElevenLabs and professional audio techniques. Specializes in TTS, voice cloning, podcast production, and voice UI design.

## When to Use This Skill

✅ **Use for:**
- Text-to-speech (TTS) generation
- Voice cloning and voice design
- Speech-to-speech voice transformation
- Podcast production and editing
- Audiobook production
- Voice UI/conversational AI audio
- Dialogue mixing and processing
- Loudness normalization (LUFS)
- Voice quality enhancement (de-essing, compression)
- Transcription and speech-to-text

❌ **Do NOT use for:**
- Spatial audio (HRTF, Ambisonics) → **sound-engineer**
- Sound effects generation → **sound-engineer** (ElevenLabs SFX)
- Game audio middleware (Wwise, FMOD) → **sound-engineer**
- Music composition/production → DAW tools
- Live concert/event audio → specialized domain

## MCP Integrations

| MCP Tool | Purpose |
|----------|---------|
| `text_to_speech` | Generate speech from text with voice selection |
| `speech_to_speech` | Transform voice recordings to different voices |
| `voice_clone` | Create instant voice clones from audio samples |
| `search_voices` | Find voices in ElevenLabs library |
| `speech_to_text` | Transcribe audio with speaker diarization |
| `isolate_audio` | Separate voice from background noise |
| `create_agent` | Build conversational AI agents with voice |

## Expert vs Novice Shibboleths

| Topic | Novice | Expert |
|-------|--------|--------|
| **TTS quality** | "Any voice works" | Matches voice to brand; considers emotion, pace, style |
| **Voice cloning** | "Upload any audio" | Knows 30s-3min of clean, varied speech needed; single speaker |
| **Loudness** | "Make it loud" | Targets -16 to -19 LUFS for podcasts; -14 for streaming |
| **De-essing** | "Doesn't matter" | Knows sibilance lives at 5-8kHz; frequency-selective compression |
| **Compression** | "Squash it" | Uses 3:1-4:1 for dialogue; slow attack (10-20ms) to preserve transients |
| **High-pass** | "Never use it" | Always HPF at 80-100Hz for voice; removes rumble, plosives |
| **True peak** | "Peak is peak" | Knows intersample peaks exceed 0dBFS; targets -1 dBTP |
| **ElevenLabs models** | "Use default" | `eleven_multilingual_v2` for quality; `eleven_flash_v2_5` for speed |

## Common Anti-Patterns

### Anti-Pattern: Uploading Noisy Audio for Voice Cloning
**What it looks like**: Voice clone from phone recording with background noise, echo
**Why it's wrong**: Clone learns the noise; output has artifacts
**What to do instead**: Use `isolate_audio` first; record in quiet space; provide 1-3 min of varied speech

### Anti-Pattern: Ignoring Loudness Standards
**What it looks like**: Podcast at -6 LUFS, then normalized by platform → crushed dynamics
**Why it's wrong**: Each platform normalizes differently; too loud = distortion, too quiet = inaudible
**What to do instead**: Master to -16 LUFS for podcasts; -14 LUFS for streaming; always check true peak < -1 dBTP

### Anti-Pattern: TTS Without Voice Matching
**What it looks like**: Using default robotic voice for premium product
**Why it's wrong**: Voice IS brand; wrong voice = wrong emotional connection
**What to do instead**: `search_voices` to find matching tone; consider custom clone for brand consistency

### Anti-Pattern: No De-essing on Processed Voice
**What it looks like**: "SSSSibilant" speech after compression and EQ boost
**Why it's wrong**: Compression brings up sibilance; EQ boost at 3-5kHz makes it worse
**What to do instead**: De-ess at 5-8kHz before compression; use frequency-selective compression

### Anti-Pattern: Single Take, No Editing
**What it looks like**: Podcast with 20 "ums", breath sounds, long pauses
**Why it's wrong**: Listeners fatigue; unprofessional; reduces engagement
**What to do instead**: Edit out filler words; gate or manually cut breaths; tighten pacing

## Evolution Timeline

### Pre-2020: Robotic TTS
- Concatenative synthesis (spliced recordings)
- Obvious robotic quality
- Limited voice options

### 2020-2022: Neural TTS Emerges
- Tacotron, WaveNet improve naturalness
- Still detectable as synthetic
- Voice cloning requires hours of data

### 2023-2024: AI Voice Revolution
- ElevenLabs instant voice cloning (30 seconds)
- Near-human quality in TTS
- Real-time voice transformation
- Voice agents for customer service

### 2025+: Current Best Practices
- Emotional TTS (control tone, pace, emotion)
- Cross-lingual voice cloning
- Real-time voice transformation in apps
- Personalized voice agents
- Voice authentication integration

## Core Concepts

### ElevenLabs Voice Selection

**Model comparison:**
| Model | Quality | Latency | Languages | Use Case |
|-------|---------|---------|-----------|----------|
| `eleven_multilingual_v2` | Best | Higher | 29 | Production, quality-critical |
| `eleven_flash_v2_5` | Good | Lowest | 32 | Real-time, voice UI |
| `eleven_turbo_v2_5` | Better | Low | 32 | Balanced |

**Voice parameters:**
```python
# Stability: 0-1 (lower = more expressive, higher = more consistent)
# Similarity boost: 0-1 (higher = closer to original voice)
# Style: 0-1 (higher = more exaggerated style)

# For natural speech:
stability = 0.5       # Balanced expression
similarity = 0.75     # Close to voice but natural
style = 0.0           # Neutral (increase for dramatic)
```

### Voice Cloning Best Practices

**Audio requirements:**
- Duration: 1-3 minutes (more = better, diminishing returns after 3min)
- Quality: Clean, no background noise, no reverb
- Content: Varied speech (questions, statements, emotions)
- Format: WAV/MP3, 44.1kHz or higher

**Cloning workflow:**
1. `isolate_audio` to clean source material
2. `voice_clone` with cleaned audio
3. Test with varied prompts
4. Adjust stability/similarity for output quality

### Voice Processing Chain

**Standard voice chain (order matters!):**
```
[Raw Recording]
    ↓
[High-Pass Filter @ 80Hz]  ← Remove rumble, plosives
    ↓
[De-esser @ 5-8kHz]        ← Before compression!
    ↓
[Compressor 3:1, 10ms/100ms] ← Smooth dynamics
    ↓
[EQ: +2dB @ 3kHz presence] ← Clarity boost
    ↓
[Limiter -1 dBTP]          ← Prevent clipping
    ↓
[Loudness Norm -16 LUFS]   ← Target loudness
```

### Loudness Standards

| Platform/Format | Target LUFS | True Peak |
|-----------------|-------------|-----------|
| Podcast | -16 to -19 | -1 dBTP |
| Audiobook (ACX) | -18 to -23 RMS | -3 dBFS |
| YouTube | -14 | -1 dBTP |
| Spotify/Apple Music | -14 | -1 dBTP |
| Broadcast (EBU R128) | -23 ±1 | -1 dBTP |

**Measurement:**
- LUFS = Loudness Units Full Scale (integrated)
- True Peak = Maximum level including intersample peaks
- Always measure with K-weighting (ITU-R BS.1770)

### Conversational AI Agents

**ElevenLabs agent configuration:**
```python
create_agent(
    name="Support Agent",
    first_message="Hi, how can I help you today?",
    system_prompt="You are a helpful customer support agent...",
    voice_id="your_voice_id",
    language="en",
    llm="gemini-2.0-flash-001",  # Fast for conversation
    temperature=0.5,
    asr_quality="high",          # Speech recognition quality
    turn_timeout=7,              # Seconds before agent responds
    max_duration_seconds=300     # 5 minute call limit
)
```

**Voice UI considerations:**
- Use fast model (`eleven_flash_v2_5`) for real-time
- Keep responses concise (&lt; 30 seconds)
- Add pauses for natural conversation flow
- Handle interruptions gracefully

## Quick Reference

### Voice Selection Decision Tree
- **Brand/professional content?** → Custom clone or curated voice
- **Real-time/interactive?** → `eleven_flash_v2_5` model
- **Quality-critical?** → `eleven_multilingual_v2` model
- **Multiple languages?** → Check language support per voice

### Processing Decision Tree
- **Voice sounds muddy?** → HPF at 80Hz, boost 3kHz
- **Sibilance harsh?** → De-ess at 5-8kHz
- **Inconsistent volume?** → Compress 3:1, then limit
- **Too quiet?** → Normalize to target LUFS
- **Background noise?** → Use `isolate_audio` first

### Common Settings
```
De-esser: 5-8kHz, -6dB reduction, Q=2
Compressor: 3:1 ratio, -20dB threshold, 10ms attack, 100ms release
EQ presence: +2-3dB shelf at 3kHz
HPF: 80-100Hz, 12dB/oct
Limiter: -1 dBTP ceiling
```

## Working With Speech Disfluencies

### Cluttering vs Stuttering

| Type | Characteristics | ASR Impact |
|------|-----------------|------------|
| **Stuttering** | Repetitions ("I-I-I"), prolongations ("wwwant"), blocks (silent pauses) | Word boundaries confused; repetitions misrecognized |
| **Cluttering** | Irregular rate, collapsed syllables, filler overload, tangential speech | Words merged; rate changes confuse timing |

### ASR Challenges with Disfluent Speech

Most ASR models trained on fluent speech. Disfluencies cause:
- Word boundary detection errors
- Repetitions transcribed literally ("I I I want" vs "I want")
- Collapsed syllables missed entirely
- Timing models confused by irregular pace

### Solutions & Workarounds

**1. Model selection (best to worst for disfluencies):**
- **Whisper large-v3** - Most robust to disfluencies
- **ElevenLabs speech_to_text** - Good with varied speech
- **Google Speech-to-Text** - Decent with enhanced models
- **Fast/lightweight models** - Usually worst

**2. Pre-processing:**
```python
# Normalize speech rate before ASR
# Use librosa to stretch irregular segments toward target rate
import librosa
y, sr = librosa.load("disfluent.wav")
y_stretched = librosa.effects.time_stretch(y, rate=0.9)  # Slow down
```

**3. Post-processing:**
- Remove duplicate words: "I I I want" → "I want"
- Filter common fillers: "um", "uh", "like", "you know"
- Use LLM to clean transcripts while preserving meaning

**4. Fine-tuning Whisper (advanced):**
```python
# Fine-tune on disfluent speech dataset
# Datasets: FluencyBank, UCLASS, SEP-28k (stuttering)
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
# Fine-tune on your speech samples with corrected transcripts
# Training loop with disfluent audio → fluent transcript pairs
```

**5. ElevenLabs voice cloning approach:**
- Clone your voice from fluent segments
- Use TTS for fluent output with your voice
- Great for pre-recorded content, not live

### Accessibility Considerations

- Always provide manual transcript correction option
- Consider hybrid: ASR + human review
- For voice UI: longer timeout, confirmation prompts
- Test with actual users from target population

## Performance Targets

| Operation | Typical Time |
|-----------|--------------|
| TTS (100 words) | 2-5 seconds |
| Voice clone creation | 10-30 seconds |
| Speech-to-speech | 3-8 seconds |
| Transcription (1 min audio) | 5-15 seconds |
| Audio isolation | 5-20 seconds |

## Integrates With

- **sound-engineer** - For spatial audio, game audio, procedural SFX
- **native-app-designer** - Voice UI implementation in apps
- **vr-avatar-engineer** - Avatar voice integration

---

**For detailed implementations**: See `/references/implementations.md`

**Remember**: Voice is intimate—it speaks directly to the listener's brain. Match voice to brand, process for clarity not loudness, and always respect the platform's loudness standards. With ElevenLabs, you have instant access to professional voice synthesis; use it thoughtfully.

Related Skills

win31-audio-design

from curiositech/some_claude_skills

Expert in Windows 3.1 era sound vocabulary for modern web/mobile apps. Creates satisfying retro UI sounds using CC-licensed 8-bit audio, Web Audio API, and haptic coordination. Activate on 'win31 sounds', 'retro audio', '90s sound effects', 'chimes', 'tada', 'ding', 'satisfying UI sounds'. NOT for modern flat UI sounds, voice synthesis, or music composition.

vr-avatar-engineer

from curiositech/some_claude_skills

Expert in photorealistic and stylized VR avatar systems for Apple Vision Pro, Meta Quest, and cross-platform metaverse. Specializes in facial tracking (52+ blend shapes), subsurface scattering, Persona-style generation, Photon networking, and real-time LOD. Activate on 'VR avatar', 'Vision Pro Persona', 'Meta avatar', 'facial tracking', 'blend shapes', 'avatar networking', 'photorealistic avatar'. NOT for 2D profile pictures (use image generation), non-VR game characters (use game engine tools), static 3D models (use modeling tools), or motion capture hardware setup.

sound-engineer

from curiositech/some_claude_skills

Expert in spatial audio, procedural sound design, game audio middleware, and app UX sound design. Specializes in HRTF/Ambisonics, Wwise/FMOD integration, UI sound design, and adaptive music systems. Activate on 'spatial audio', 'HRTF', 'binaural', 'Wwise', 'FMOD', 'procedural sound', 'footstep system', 'adaptive music', 'UI sounds', 'notification audio', 'sonic branding'. NOT for music composition/production (use DAW), audio post-production for film (linear media), voice cloning/TTS (use voice-audio-engineer), podcast editing (use standard audio editors), or hardware design.

site-reliability-engineer

from curiositech/some_claude_skills

Docusaurus build health validation and deployment safety for Claude Skills showcase. Pre-commit MDX validation (Liquid syntax, angle brackets, prop mismatches), pre-build link checking, post-build health reports. Activate on 'build errors', 'commit hooks', 'deployment safety', 'site health', 'MDX validation'. NOT for general DevOps (use deployment-engineer), Kubernetes/cloud infrastructure (use kubernetes-architect), runtime monitoring (use observability-engineer), or non-Docusaurus projects.

prompt-engineer

from curiositech/some_claude_skills

Expert prompt optimization for LLMs and AI systems. Use PROACTIVELY when building AI features, improving agent performance, or crafting system prompts. Masters prompt patterns and techniques.

data-pipeline-engineer

from curiositech/some_claude_skills

Expert data engineer for ETL/ELT pipelines, streaming, data warehousing. Activate on: data pipeline, ETL, ELT, data warehouse, Spark, Kafka, Airflow, dbt, data modeling, star schema, streaming data, batch processing, data quality. NOT for: API design (use api-architect), ML training (use ML skills), dashboards (use design skills).

ai-engineer

from curiositech/some_claude_skills

Build production-ready LLM applications, advanced RAG systems, and intelligent agents. Implements vector search, multimodal AI, agent orchestration, and enterprise AI integrations. Use PROACTIVELY for LLM features, chatbots, AI agents, or AI-powered applications.

skill-coach

from curiositech/some_claude_skills

Guides creation of high-quality Agent Skills with domain expertise, anti-pattern detection, and progressive disclosure best practices. Use when creating skills, reviewing existing skills, or when users mention improving skill quality, encoding expertise, or avoiding common AI tooling mistakes. Activate on keywords: create skill, review skill, skill quality, skill best practices, skill anti-patterns. NOT for general coding advice or non-skill Claude Code features.

3d-cv-labeling-2026

from curiositech/some_claude_skills

Expert in 3D computer vision labeling tools, workflows, and AI-assisted annotation for LiDAR, point clouds, and sensor fusion. Covers SAM4D/Point-SAM, human-in-the-loop architectures, and vertical-specific training strategies. Activate on '3D labeling', 'point cloud annotation', 'LiDAR labeling', 'SAM 3D', 'SAM4D', 'sensor fusion annotation', '3D bounding box', 'semantic segmentation point cloud'. NOT for 2D image labeling (use clip-aware-embeddings), general ML training (use ml-engineer), video annotation without 3D (use computer-vision-pipeline), or VLM prompt engineering (use prompt-engineer).

wisdom-accountability-coach

from curiositech/some_claude_skills

Longitudinal memory tracking, philosophy teaching, and personal accountability with compassion. Expert in pattern recognition, Stoicism/Buddhism, and growth guidance. Activate on 'accountability', 'philosophy', 'Stoicism', 'Buddhism', 'personal growth', 'commitment tracking', 'wisdom teaching'. NOT for therapy or mental health treatment (refer to professionals), crisis intervention, or replacing professional coaching credentials.

windows-95-web-designer

from curiositech/some_claude_skills

Modern web applications with authentic Windows 95 aesthetic. Gradient title bars, Start menu paradigm, taskbar patterns, 3D beveled chrome. Extrapolates Win95 to AI chatbots, mobile UIs, responsive layouts. Activate on 'windows 95', 'win95', 'start menu', 'taskbar', 'retro desktop', '95 aesthetic', 'clippy'. NOT for Windows 3.1 (use windows-3-1-web-designer), vaporwave/synthwave, macOS, flat design.

windows-3-1-web-designer

from curiositech/some_claude_skills

Modern web applications with authentic Windows 3.1 aesthetic. Solid navy title bars, Program Manager navigation, beveled borders, single window controls. Extrapolates Win31 to AI chatbots (Cue Card paradigm), mobile UIs (pocket computing). Activate on 'windows 3.1', 'win31', 'program manager', 'retro desktop', '90s aesthetic', 'beveled'. NOT for Windows 95 (use windows-95-web-designer - has gradients, Start menu), vaporwave/synthwave, macOS, flat design.