ClaudeChatGPTGeminiDeepSeekCursorGitHub CopilotAiderContinueVoice AI & Speech Processing

azure-ai-voicelive-py

Build real-time voice AI applications with bidirectional WebSocket communication.

31,392 stars
Complexity: medium

About this skill

This skill provides Python bindings for the Azure AI Voice Live SDK, allowing AI agents to programmatically create and manage real-time voice artificial intelligence applications. It leverages bidirectional WebSocket communication for efficient, low-latency audio streaming and processing. Agents can use this skill to perform tasks like real-time speech-to-text, text-to-speech, and custom voice interactions, by integrating with Azure Cognitive Services endpoints. It supports robust authentication via `DefaultAzureCredential`, making it suitable for secure production environments, while also offering API key authentication for development purposes.

Best use case

Developing interactive voice assistants, enabling real-time transcription for live events, creating dynamic voice control interfaces, building conversational AI agents that require instant audio feedback, integrating live speech processing into agent workflows, and enhancing applications with advanced real-time voice capabilities.

Build real-time voice AI applications with bidirectional WebSocket communication.

The AI agent will be able to establish a real-time, bidirectional voice communication channel with Azure Cognitive Services, enabling seamless speech-to-text, text-to-speech, or other custom voice AI interactions. Users can expect low-latency, high-quality audio processing and generation, allowing for highly responsive voice applications.

Practical example

Example input

Connect to the Azure AI Voice Live endpoint `https://eastus.api.cognitive.microsoft.com` using `DefaultAzureCredential` to enable real-time speech transcription. Once connected, continuously stream audio from a microphone, print recognized text, and synthesize a greeting 'Hello, how can I help you?' to be played back.

Example output

Established real-time voice AI connection to Azure. Streaming audio input enabled.
(Agent processes incoming audio stream)
Detected speech: 'What's the weather like in New York?'
Synthesized and played back: 'The current weather in New York is 25 degrees Celsius and sunny. How else may I assist you?'
(Continuous real-time interaction continues with subsequent recognized speech and synthesized responses.)

When to use this skill

  • When an AI agent needs to process or generate speech in real-time.
  • When low-latency, bidirectional audio streaming is critical for an application.
  • When building conversational AI interfaces that require immediate voice input/output.
  • When integrating with Azure Cognitive Services for advanced voice capabilities.

When not to use this skill

  • For batch processing of audio files where real-time interaction is not required.
  • When offline speech processing is sufficient and continuous streaming is overkill.
  • If the primary requirement is simple, non-interactive text-to-speech or speech-to-text without a need for continuous streaming.
  • When working with cloud platforms or services other than Azure AI Voice Live, unless an abstraction layer is used.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/azure-ai-voicelive-py/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/azure-ai-voicelive-py/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/azure-ai-voicelive-py/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How azure-ai-voicelive-py Compares

Feature / Agentazure-ai-voicelive-pyStandard Approach
Platform SupportClaude, ChatGPT, Gemini, DeepSeek, Cursor, GitHub Copilot, Aider, ContinueLimited / Varies
Context Awareness High Baseline
Installation ComplexitymediumN/A

Frequently Asked Questions

What does this skill do?

Build real-time voice AI applications with bidirectional WebSocket communication.

Which AI agents support this skill?

This skill is designed for Claude, ChatGPT, Gemini, DeepSeek, Cursor, GitHub Copilot, Aider, Continue.

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Azure AI Voice Live SDK

Build real-time voice AI applications with bidirectional WebSocket communication.

## Installation

```bash
pip install azure-ai-voicelive aiohttp azure-identity
```

## Environment Variables

```bash
AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com
# For API key auth (not recommended for production)
AZURE_COGNITIVE_SERVICES_KEY=<api-key>
```

## Authentication

**DefaultAzureCredential (preferred)**:
```python
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...
```

**API Key**:
```python
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
    model="gpt-4o-realtime-preview"
) as conn:
    ...
```

## Quick Start

```python
import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async def main():
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=DefaultAzureCredential(),
        model="gpt-4o-realtime-preview",
        credential_scopes=["https://cognitiveservices.azure.com/.default"]
    ) as conn:
        # Update session with instructions
        await conn.session.update(session={
            "instructions": "You are a helpful assistant.",
            "modalities": ["text", "audio"],
            "voice": "alloy"
        })
        
        # Listen for events
        async for event in conn:
            print(f"Event: {event.type}")
            if event.type == "response.audio_transcript.done":
                print(f"Transcript: {event.transcript}")
            elif event.type == "response.done":
                break

asyncio.run(main())
```

## Core Architecture

### Connection Resources

The `VoiceLiveConnection` exposes these resources:

| Resource | Purpose | Key Methods |
|----------|---------|-------------|
| `conn.session` | Session configuration | `update(session=...)` |
| `conn.response` | Model responses | `create()`, `cancel()` |
| `conn.input_audio_buffer` | Audio input | `append()`, `commit()`, `clear()` |
| `conn.output_audio_buffer` | Audio output | `clear()` |
| `conn.conversation` | Conversation state | `item.create()`, `item.delete()`, `item.truncate()` |
| `conn.transcription_session` | Transcription config | `update(session=...)` |

## Session Configuration

```python
from azure.ai.voicelive.models import RequestSession, FunctionTool

await conn.session.update(session=RequestSession(
    instructions="You are a helpful voice assistant.",
    modalities=["text", "audio"],
    voice="alloy",  # or "echo", "shimmer", "sage", etc.
    input_audio_format="pcm16",
    output_audio_format="pcm16",
    turn_detection={
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 500
    },
    tools=[
        FunctionTool(
            type="function",
            name="get_weather",
            description="Get current weather",
            parameters={
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        )
    ]
))
```

## Audio Streaming

### Send Audio (Base64 PCM16)

```python
import base64

# Read audio chunk (16-bit PCM, 24kHz mono)
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()

await conn.input_audio_buffer.append(audio=b64_audio)
```

### Receive Audio

```python
async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("Audio complete")
```

## Event Handling

```python
async for event in conn:
    match event.type:
        # Session events
        case "session.created":
            print(f"Session: {event.session}")
        case "session.updated":
            print("Session updated")
        
        # Audio input events
        case "input_audio_buffer.speech_started":
            print(f"Speech started at {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"Speech stopped at {event.audio_end_ms}ms")
        
        # Transcription events
        case "conversation.item.input_audio_transcription.completed":
            print(f"User said: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"Partial: {event.delta}")
        
        # Response events
        case "response.created":
            print(f"Response started: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"Response complete: {event.response.status}")
        
        # Function calls
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()
        
        # Errors
        case "error":
            print(f"Error: {event.error.message}")
```

## Common Patterns

### Manual Turn Mode (No VAD)

```python
await conn.session.update(session={"turn_detection": None})

# Manually control turns
await conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit()  # End of user turn
await conn.response.create()  # Trigger response
```

### Interrupt Handling

```python
async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # User interrupted - cancel current response
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()
```

### Conversation History

```python
# Add system message
await conn.conversation.item.create(item={
    "type": "message",
    "role": "system",
    "content": [{"type": "input_text", "text": "Be concise."}]
})

# Add user message
await conn.conversation.item.create(item={
    "type": "message",
    "role": "user", 
    "content": [{"type": "input_text", "text": "Hello!"}]
})

await conn.response.create()
```

## Voice Options

| Voice | Description |
|-------|-------------|
| `alloy` | Neutral, balanced |
| `echo` | Warm, conversational |
| `shimmer` | Clear, professional |
| `sage` | Calm, authoritative |
| `coral` | Friendly, upbeat |
| `ash` | Deep, measured |
| `ballad` | Expressive |
| `verse` | Storytelling |

Azure voices: Use `AzureStandardVoice`, `AzureCustomVoice`, or `AzurePersonalVoice` models.

## Audio Formats

| Format | Sample Rate | Use Case |
|--------|-------------|----------|
| `pcm16` | 24kHz | Default, high quality |
| `pcm16-8000hz` | 8kHz | Telephony |
| `pcm16-16000hz` | 16kHz | Voice assistants |
| `g711_ulaw` | 8kHz | Telephony (US) |
| `g711_alaw` | 8kHz | Telephony (EU) |

## Turn Detection Options

```python
# Server VAD (default)
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}

# Azure Semantic VAD (smarter detection)
{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"}  # English optimized
{"type": "azure_semantic_vad_multilingual"}
```

## Error Handling

```python
from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"Connection error: {e}")
```

## References

- **Detailed API Reference**: See references/api-reference.md
- **Complete Examples**: See references/examples.md
- **All Models & Types**: See references/models.md

## When to Use
This skill is applicable to execute the workflow or actions described in the overview.

Related Skills

azure-ai-voicelive-ts

31392
from sickn33/antigravity-awesome-skills

Azure AI Voice Live SDK for JavaScript/TypeScript. Build real-time voice AI applications with bidirectional WebSocket communication.

Voice AI & Speech ProcessingClaudeChatGPTGemini

azure-ai-voicelive-java

31392
from sickn33/antigravity-awesome-skills

Azure AI VoiceLive SDK for Java. Real-time bidirectional voice conversations with AI assistants using WebSocket.

Voice AI & Speech ProcessingClaudeGitHub CopilotAider

azure-ai-voicelive-dotnet

31392
from sickn33/antigravity-awesome-skills

Azure AI Voice Live SDK for .NET. Build real-time voice AI applications with bidirectional WebSocket communication.

Voice AI & Speech ProcessingClaudeChatGPTGemini

microsoft-azure-webjobs-extensions-authentication-events-dotnet

31392
from sickn33/antigravity-awesome-skills

Microsoft Entra Authentication Events SDK for .NET. Azure Functions triggers for custom authentication extensions.

Identity Management / Authentication & AuthorizationClaude

azure-web-pubsub-ts

31392
from sickn33/antigravity-awesome-skills

Real-time messaging with WebSocket connections and pub/sub patterns.

Messaging & CommunicationClaude

azure-storage-queue-ts

31392
from sickn33/antigravity-awesome-skills

Azure Queue Storage JavaScript/TypeScript SDK (@azure/storage-queue) for message queue operations. Use for sending, receiving, peeking, and deleting messages in queues.

Cloud IntegrationClaude

azure-storage-queue-py

31392
from sickn33/antigravity-awesome-skills

Azure Queue Storage SDK for Python. Use for reliable message queuing, task distribution, and asynchronous processing.

Cloud IntegrationClaude

azure-storage-file-share-ts

31392
from sickn33/antigravity-awesome-skills

Azure File Share JavaScript/TypeScript SDK (@azure/storage-file-share) for SMB file share operations.

Cloud Storage ManagementClaude

azure-storage-file-share-py

31392
from sickn33/antigravity-awesome-skills

Azure Storage File Share SDK for Python. Use for SMB file shares, directories, and file operations in the cloud.

Cloud Storage ManagementClaude

azure-storage-file-datalake-py

31392
from sickn33/antigravity-awesome-skills

Azure Data Lake Storage Gen2 SDK for Python. Use for hierarchical file systems, big data analytics, and file/directory operations.

Cloud Storage ManagementClaude

azure-storage-blob-ts

31392
from sickn33/antigravity-awesome-skills

Azure Blob Storage JavaScript/TypeScript SDK (@azure/storage-blob) for blob operations. Use for uploading, downloading, listing, and managing blobs and containers.

Cloud Storage ManagementClaude

azure-storage-blob-rust

31392
from sickn33/antigravity-awesome-skills

Azure Blob Storage SDK for Rust. Use for uploading, downloading, and managing blobs and containers.

Cloud Storage ManagementClaude