discord-voice

Real-time voice conversations in Discord voice channels with Claude AI

7 stars

byDemerzels-lab

View on GitHub Installation ↓

Best use case

discord-voice is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Real-time voice conversations in Discord voice channels with Claude AI

Teams using discord-voice should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/discord-voice/SKILL.md --create-dirs "https://raw.githubusercontent.com/Demerzels-lab/elsamultiskillagent/main/public/skills/avatarneil/discord-voice/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/discord-voice/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How discord-voice Compares

Feature / Agent	discord-voice	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Real-time voice conversations in Discord voice channels with Claude AI

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Discord Voice Plugin for Clawdbot

Real-time voice conversations in Discord voice channels. Join a voice channel, speak, and have your words transcribed, processed by Claude, and spoken back.

## Features

- **Join/Leave Voice Channels**: Via slash commands, CLI, or agent tool
- **Voice Activity Detection (VAD)**: Automatically detects when users are speaking
- **Speech-to-Text**: Whisper API (OpenAI) or Deepgram
- **Streaming STT**: Real-time transcription with Deepgram WebSocket (~1s latency reduction)
- **Agent Integration**: Transcribed speech is routed through the Clawdbot agent
- **Text-to-Speech**: OpenAI TTS or ElevenLabs
- **Audio Playback**: Responses are spoken back in the voice channel
- **Barge-in Support**: Stops speaking immediately when user starts talking
- **Auto-reconnect**: Automatic heartbeat monitoring and reconnection on disconnect

## Requirements

- Discord bot with voice permissions (Connect, Speak, Use Voice Activity)
- API keys for STT and TTS providers
- System dependencies for voice:
  - `ffmpeg` (audio processing)
  - Native build tools for `@discordjs/opus` and `sodium-native`

## Installation

### 1. Install System Dependencies

```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg build-essential python3

# Fedora/RHEL
sudo dnf install ffmpeg gcc-c++ make python3

# macOS
brew install ffmpeg
```

### 2. Install via ClawdHub

```bash
clawdhub install discord-voice
```

Or manually:

```bash
cd ~/.clawdbot/extensions
git clone <repository-url> discord-voice
cd discord-voice
npm install
```

### 3. Configure in clawdbot.json

```json5
{
  "plugins": {
    "entries": {
      "discord-voice": {
        "enabled": true,
        "config": {
          "sttProvider": "whisper",
          "ttsProvider": "openai",
          "ttsVoice": "nova",
          "vadSensitivity": "medium",
          "allowedUsers": [],  // Empty = allow all users
          "silenceThresholdMs": 1500,
          "maxRecordingMs": 30000,
          "openai": {
            "apiKey": "sk-..."  // Or use OPENAI_API_KEY env var
          }
        }
      }
    }
  }
}
```

### 4. Discord Bot Setup

Ensure your Discord bot has these permissions:
- **Connect** - Join voice channels
- **Speak** - Play audio
- **Use Voice Activity** - Detect when users speak

Add these to your bot's OAuth2 URL or configure in Discord Developer Portal.

## Configuration

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `enabled` | boolean | `true` | Enable/disable the plugin |
| `sttProvider` | string | `"whisper"` | `"whisper"` or `"deepgram"` |
| `streamingSTT` | boolean | `true` | Use streaming STT (Deepgram only, ~1s faster) |
| `ttsProvider` | string | `"openai"` | `"openai"` or `"elevenlabs"` |
| `ttsVoice` | string | `"nova"` | Voice ID for TTS |
| `vadSensitivity` | string | `"medium"` | `"low"`, `"medium"`, or `"high"` |
| `bargeIn` | boolean | `true` | Stop speaking when user talks |
| `allowedUsers` | string[] | `[]` | User IDs allowed (empty = all) |
| `silenceThresholdMs` | number | `1500` | Silence before processing (ms) |
| `maxRecordingMs` | number | `30000` | Max recording length (ms) |
| `heartbeatIntervalMs` | number | `30000` | Connection health check interval |
| `autoJoinChannel` | string | `undefined` | Channel ID to auto-join on startup |

### Provider Configuration

#### OpenAI (Whisper + TTS)
```json5
{
  "openai": {
    "apiKey": "sk-...",
    "whisperModel": "whisper-1",
    "ttsModel": "tts-1"
  }
}
```

#### ElevenLabs (TTS only)
```json5
{
  "elevenlabs": {
    "apiKey": "...",
    "voiceId": "21m00Tcm4TlvDq8ikWAM",  // Rachel
    "modelId": "eleven_multilingual_v2"
  }
}
```

#### Deepgram (STT only)
```json5
{
  "deepgram": {
    "apiKey": "...",
    "model": "nova-2"
  }
}
```

## Usage

### Slash Commands (Discord)

Once registered with Discord, use these commands:
- `/voice join <channel>` - Join a voice channel
- `/voice leave` - Leave the current voice channel
- `/voice status` - Show voice connection status

### CLI Commands

```bash
# Join a voice channel
clawdbot voice join <channelId>

# Leave voice
clawdbot voice leave --guild <guildId>

# Check status
clawdbot voice status
```

### Agent Tool

The agent can use the `discord_voice` tool:
```
Join voice channel 1234567890
```

The tool supports actions:
- `join` - Join a voice channel (requires channelId)
- `leave` - Leave voice channel
- `speak` - Speak text in the voice channel
- `status` - Get current voice status

## How It Works

1. **Join**: Bot joins the specified voice channel
2. **Listen**: VAD detects when users start/stop speaking
3. **Record**: Audio is buffered while user speaks
4. **Transcribe**: On silence, audio is sent to STT provider
5. **Process**: Transcribed text is sent to Clawdbot agent
6. **Synthesize**: Agent response is converted to audio via TTS
7. **Play**: Audio is played back in the voice channel

## Streaming STT (Deepgram)

When using Deepgram as your STT provider, streaming mode is enabled by default. This provides:

- **~1 second faster** end-to-end latency
- **Real-time feedback** with interim transcription results
- **Automatic keep-alive** to prevent connection timeouts
- **Fallback** to batch transcription if streaming fails

To use streaming STT:
```json5
{
  "sttProvider": "deepgram",
  "streamingSTT": true,  // default
  "deepgram": {
    "apiKey": "...",
    "model": "nova-2"
  }
}
```

## Barge-in Support

When enabled (default), the bot will immediately stop speaking if a user starts talking. This creates a more natural conversational flow where you can interrupt the bot.

To disable (let the bot finish speaking):
```json5
{
  "bargeIn": false
}
```

## Auto-reconnect

The plugin includes automatic connection health monitoring:

- **Heartbeat checks** every 30 seconds (configurable)
- **Auto-reconnect** on disconnect with exponential backoff
- **Max 3 attempts** before giving up

If the connection drops, you'll see logs like:
```
[discord-voice] Disconnected from voice channel
[discord-voice] Reconnection attempt 1/3
[discord-voice] Reconnected successfully
```

## VAD Sensitivity

- **low**: Picks up quiet speech, may trigger on background noise
- **medium**: Balanced (recommended)
- **high**: Requires louder, clearer speech

## Troubleshooting

### "Discord client not available"
Ensure the Discord channel is configured and the bot is connected before using voice.

### Opus/Sodium build errors
Install build tools:
```bash
npm install -g node-gyp
npm rebuild @discordjs/opus sodium-native
```

### No audio heard
1. Check bot has Connect + Speak permissions
2. Check bot isn't server muted
3. Verify TTS API key is valid

### Transcription not working
1. Check STT API key is valid
2. Check audio is being recorded (see debug logs)
3. Try adjusting VAD sensitivity

### Enable debug logging
```bash
DEBUG=discord-voice clawdbot gateway start
```

## Environment Variables

| Variable | Description |
|----------|-------------|
| `DISCORD_TOKEN` | Discord bot token (required) |
| `OPENAI_API_KEY` | OpenAI API key (Whisper + TTS) |
| `ELEVENLABS_API_KEY` | ElevenLabs API key |
| `DEEPGRAM_API_KEY` | Deepgram API key |

## Limitations

- Only one voice channel per guild at a time
- Maximum recording length: 30 seconds (configurable)
- Requires stable network for real-time audio
- TTS output may have slight delay due to synthesis

## License

MIT

Related Skills

invoice-tracker-pro

from Demerzels-lab/elsamultiskillagent

Complete freelance billing workflow — generate professional invoices, track payment status, send automated.

invoice-template

from Demerzels-lab/elsamultiskillagent

Free simple invoice generator.

discord-doctor

from Demerzels-lab/elsamultiskillagent

Quick diagnosis and repair for Discord bot, Gateway, OAuth token, and legacy config issues. Checks connectivity, token expiration, and cleans up old Clawdis artifacts.

voicemonkey

from Demerzels-lab/elsamultiskillagent

Control Alexa devices via VoiceMonkey API v2 - make announcements, trigger routines, start flows, and display media.

vibevoice

from Demerzels-lab/elsamultiskillagent

Local Spanish TTS using Microsoft VibeVoice.

percept-voice-cmd

from Demerzels-lab/elsamultiskillagent

Voice command detection and action execution for OpenClaw agents.

free-groq-voice

from Demerzels-lab/elsamultiskillagent

FREE voice recognition using Groq's complimentary Whisper API.

voice-recognition

from Demerzels-lab/elsamultiskillagent

Local speech-to-text with OpenAI Whisper CLI.

x-voice-match

from Demerzels-lab/elsamultiskillagent

Analyze a Twitter/X account's posting style and generate authentic posts that match their voice. Use when the user wants to create X posts that sound like them, analyze their posting patterns, or maintain consistent voice across posts. Works with Bird CLI integration.

jarvis-voice

from Demerzels-lab/elsamultiskillagent

Metallic AI voice persona with TTS and visual transcript styling.

voiceai-voiceover-creator

from Demerzels-lab/elsamultiskillagent

Turn scripts into publishable voiceovers with Voice.ai TTS, including segments, chapters, captions, and video muxing.

voice-ai-tts

from Demerzels-lab/elsamultiskillagent

High-quality voice synthesis with 9 personas, 11 languages, streaming, and voice cloning using Voice.ai API.