azure-ai-voicelive-py
Build real-time voice AI applications with bidirectional WebSocket communication.
About this skill
This skill provides Python bindings for the Azure AI Voice Live SDK, allowing AI agents to programmatically create and manage real-time voice artificial intelligence applications. It leverages bidirectional WebSocket communication for efficient, low-latency audio streaming and processing. Agents can use this skill to perform tasks like real-time speech-to-text, text-to-speech, and custom voice interactions, by integrating with Azure Cognitive Services endpoints. It supports robust authentication via `DefaultAzureCredential`, making it suitable for secure production environments, while also offering API key authentication for development purposes.
Best use case
Developing interactive voice assistants, enabling real-time transcription for live events, creating dynamic voice control interfaces, building conversational AI agents that require instant audio feedback, integrating live speech processing into agent workflows, and enhancing applications with advanced real-time voice capabilities.
Build real-time voice AI applications with bidirectional WebSocket communication.
The AI agent will be able to establish a real-time, bidirectional voice communication channel with Azure Cognitive Services, enabling seamless speech-to-text, text-to-speech, or other custom voice AI interactions. Users can expect low-latency, high-quality audio processing and generation, allowing for highly responsive voice applications.
Practical example
Example input
Connect to the Azure AI Voice Live endpoint `https://eastus.api.cognitive.microsoft.com` using `DefaultAzureCredential` to enable real-time speech transcription. Once connected, continuously stream audio from a microphone, print recognized text, and synthesize a greeting 'Hello, how can I help you?' to be played back.
Example output
Established real-time voice AI connection to Azure. Streaming audio input enabled. (Agent processes incoming audio stream) Detected speech: 'What's the weather like in New York?' Synthesized and played back: 'The current weather in New York is 25 degrees Celsius and sunny. How else may I assist you?' (Continuous real-time interaction continues with subsequent recognized speech and synthesized responses.)
When to use this skill
- When an AI agent needs to process or generate speech in real-time.
- When low-latency, bidirectional audio streaming is critical for an application.
- When building conversational AI interfaces that require immediate voice input/output.
- When integrating with Azure Cognitive Services for advanced voice capabilities.
When not to use this skill
- For batch processing of audio files where real-time interaction is not required.
- When offline speech processing is sufficient and continuous streaming is overkill.
- If the primary requirement is simple, non-interactive text-to-speech or speech-to-text without a need for continuous streaming.
- When working with cloud platforms or services other than Azure AI Voice Live, unless an abstraction layer is used.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/azure-ai-voicelive-py/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How azure-ai-voicelive-py Compares
| Feature / Agent | azure-ai-voicelive-py | Standard Approach |
|---|---|---|
| Platform Support | Claude, ChatGPT, Gemini, DeepSeek, Cursor, GitHub Copilot, Aider, Continue | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | medium | N/A |
Frequently Asked Questions
What does this skill do?
Build real-time voice AI applications with bidirectional WebSocket communication.
Which AI agents support this skill?
This skill is designed for Claude, ChatGPT, Gemini, DeepSeek, Cursor, GitHub Copilot, Aider, Continue.
How difficult is it to install?
The installation complexity is rated as medium. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# Azure AI Voice Live SDK
Build real-time voice AI applications with bidirectional WebSocket communication.
## Installation
```bash
pip install azure-ai-voicelive aiohttp azure-identity
```
## Environment Variables
```bash
AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com
# For API key auth (not recommended for production)
AZURE_COGNITIVE_SERVICES_KEY=<api-key>
```
## Authentication
**DefaultAzureCredential (preferred)**:
```python
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
...
```
**API Key**:
```python
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
model="gpt-4o-realtime-preview"
) as conn:
...
```
## Quick Start
```python
import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async def main():
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
# Update session with instructions
await conn.session.update(session={
"instructions": "You are a helpful assistant.",
"modalities": ["text", "audio"],
"voice": "alloy"
})
# Listen for events
async for event in conn:
print(f"Event: {event.type}")
if event.type == "response.audio_transcript.done":
print(f"Transcript: {event.transcript}")
elif event.type == "response.done":
break
asyncio.run(main())
```
## Core Architecture
### Connection Resources
The `VoiceLiveConnection` exposes these resources:
| Resource | Purpose | Key Methods |
|----------|---------|-------------|
| `conn.session` | Session configuration | `update(session=...)` |
| `conn.response` | Model responses | `create()`, `cancel()` |
| `conn.input_audio_buffer` | Audio input | `append()`, `commit()`, `clear()` |
| `conn.output_audio_buffer` | Audio output | `clear()` |
| `conn.conversation` | Conversation state | `item.create()`, `item.delete()`, `item.truncate()` |
| `conn.transcription_session` | Transcription config | `update(session=...)` |
## Session Configuration
```python
from azure.ai.voicelive.models import RequestSession, FunctionTool
await conn.session.update(session=RequestSession(
instructions="You are a helpful voice assistant.",
modalities=["text", "audio"],
voice="alloy", # or "echo", "shimmer", "sage", etc.
input_audio_format="pcm16",
output_audio_format="pcm16",
turn_detection={
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
tools=[
FunctionTool(
type="function",
name="get_weather",
description="Get current weather",
parameters={
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
)
]
))
```
## Audio Streaming
### Send Audio (Base64 PCM16)
```python
import base64
# Read audio chunk (16-bit PCM, 24kHz mono)
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()
await conn.input_audio_buffer.append(audio=b64_audio)
```
### Receive Audio
```python
async for event in conn:
if event.type == "response.audio.delta":
audio_bytes = base64.b64decode(event.delta)
await play_audio(audio_bytes)
elif event.type == "response.audio.done":
print("Audio complete")
```
## Event Handling
```python
async for event in conn:
match event.type:
# Session events
case "session.created":
print(f"Session: {event.session}")
case "session.updated":
print("Session updated")
# Audio input events
case "input_audio_buffer.speech_started":
print(f"Speech started at {event.audio_start_ms}ms")
case "input_audio_buffer.speech_stopped":
print(f"Speech stopped at {event.audio_end_ms}ms")
# Transcription events
case "conversation.item.input_audio_transcription.completed":
print(f"User said: {event.transcript}")
case "conversation.item.input_audio_transcription.delta":
print(f"Partial: {event.delta}")
# Response events
case "response.created":
print(f"Response started: {event.response.id}")
case "response.audio_transcript.delta":
print(event.delta, end="", flush=True)
case "response.audio.delta":
audio = base64.b64decode(event.delta)
case "response.done":
print(f"Response complete: {event.response.status}")
# Function calls
case "response.function_call_arguments.done":
result = handle_function(event.name, event.arguments)
await conn.conversation.item.create(item={
"type": "function_call_output",
"call_id": event.call_id,
"output": json.dumps(result)
})
await conn.response.create()
# Errors
case "error":
print(f"Error: {event.error.message}")
```
## Common Patterns
### Manual Turn Mode (No VAD)
```python
await conn.session.update(session={"turn_detection": None})
# Manually control turns
await conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit() # End of user turn
await conn.response.create() # Trigger response
```
### Interrupt Handling
```python
async for event in conn:
if event.type == "input_audio_buffer.speech_started":
# User interrupted - cancel current response
await conn.response.cancel()
await conn.output_audio_buffer.clear()
```
### Conversation History
```python
# Add system message
await conn.conversation.item.create(item={
"type": "message",
"role": "system",
"content": [{"type": "input_text", "text": "Be concise."}]
})
# Add user message
await conn.conversation.item.create(item={
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Hello!"}]
})
await conn.response.create()
```
## Voice Options
| Voice | Description |
|-------|-------------|
| `alloy` | Neutral, balanced |
| `echo` | Warm, conversational |
| `shimmer` | Clear, professional |
| `sage` | Calm, authoritative |
| `coral` | Friendly, upbeat |
| `ash` | Deep, measured |
| `ballad` | Expressive |
| `verse` | Storytelling |
Azure voices: Use `AzureStandardVoice`, `AzureCustomVoice`, or `AzurePersonalVoice` models.
## Audio Formats
| Format | Sample Rate | Use Case |
|--------|-------------|----------|
| `pcm16` | 24kHz | Default, high quality |
| `pcm16-8000hz` | 8kHz | Telephony |
| `pcm16-16000hz` | 16kHz | Voice assistants |
| `g711_ulaw` | 8kHz | Telephony (US) |
| `g711_alaw` | 8kHz | Telephony (EU) |
## Turn Detection Options
```python
# Server VAD (default)
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}
# Azure Semantic VAD (smarter detection)
{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"} # English optimized
{"type": "azure_semantic_vad_multilingual"}
```
## Error Handling
```python
from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed
try:
async with connect(...) as conn:
async for event in conn:
if event.type == "error":
print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
print(f"Connection error: {e}")
```
## References
- **Detailed API Reference**: See references/api-reference.md
- **Complete Examples**: See references/examples.md
- **All Models & Types**: See references/models.md
## When to Use
This skill is applicable to execute the workflow or actions described in the overview.Related Skills
azure-ai-voicelive-ts
Azure AI Voice Live SDK for JavaScript/TypeScript. Build real-time voice AI applications with bidirectional WebSocket communication.
azure-ai-voicelive-java
Azure AI VoiceLive SDK for Java. Real-time bidirectional voice conversations with AI assistants using WebSocket.
azure-ai-voicelive-dotnet
Azure AI Voice Live SDK for .NET. Build real-time voice AI applications with bidirectional WebSocket communication.
microsoft-azure-webjobs-extensions-authentication-events-dotnet
Microsoft Entra Authentication Events SDK for .NET. Azure Functions triggers for custom authentication extensions.
azure-web-pubsub-ts
Real-time messaging with WebSocket connections and pub/sub patterns.
azure-storage-queue-ts
Azure Queue Storage JavaScript/TypeScript SDK (@azure/storage-queue) for message queue operations. Use for sending, receiving, peeking, and deleting messages in queues.
azure-storage-queue-py
Azure Queue Storage SDK for Python. Use for reliable message queuing, task distribution, and asynchronous processing.
azure-storage-file-share-ts
Azure File Share JavaScript/TypeScript SDK (@azure/storage-file-share) for SMB file share operations.
azure-storage-file-share-py
Azure Storage File Share SDK for Python. Use for SMB file shares, directories, and file operations in the cloud.
azure-storage-file-datalake-py
Azure Data Lake Storage Gen2 SDK for Python. Use for hierarchical file systems, big data analytics, and file/directory operations.
azure-storage-blob-ts
Azure Blob Storage JavaScript/TypeScript SDK (@azure/storage-blob) for blob operations. Use for uploading, downloading, listing, and managing blobs and containers.
azure-storage-blob-rust
Azure Blob Storage SDK for Rust. Use for uploading, downloading, and managing blobs and containers.