azure-speech-to-text-rest-py
Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.
About this skill
This skill empowers an AI agent to perform high-accuracy speech-to-text transcription for audio files with a maximum duration of 60 seconds. It directly utilizes the Azure Cognitive Services Speech to Text REST API, bypassing the need for the full Azure Speech SDK, which simplifies integration and reduces dependencies. Optimized for quick processing of brief audio inputs, it's an ideal tool for agents designed to handle voice commands, short dictations, or concise audio messages. Users are required to have an active Azure subscription, a provisioned Azure Speech resource, and the corresponding API keys and endpoint configured as environment variables for authentication.
Best use case
Converting short voice commands or queries into text for an AI agent to process. Transcribing brief audio notes or messages for storage or further text-based analysis. Enabling voice-controlled interactions in applications where user input is typically short and direct. Processing short audio recordings as part of a larger workflow that requires text conversion before tasks like sentiment analysis or entity recognition.
Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.
The AI agent will receive an accurate textual transcription of the provided short audio input.
Practical example
Example input
An AI agent receives an audio recording containing the spoken phrase 'Schedule a meeting for tomorrow at 2 PM.' and invokes this skill with the audio data.
Example output
{"transcription": "Schedule a meeting for tomorrow at 2 PM."}When to use this skill
- When you need to accurately transcribe audio files that are 60 seconds or less in duration.
- When your AI agent requires speech recognition capabilities and you prefer a direct REST API approach over the Azure Speech SDK.
- When integrating speech-to-text into a Claude agent that processes short verbal inputs or user commands.
- When an Azure-based speech recognition solution is preferred for its enterprise-grade accuracy and integration capabilities.
When not to use this skill
- When transcribing audio files that exceed the 60-second duration limit, as this skill is not designed for longer content.
- When advanced speech recognition features are required, such as real-time continuous transcription, custom acoustic models, speaker diarization, or language identification, which are typically found in the full Azure Speech SDK.
- When offline speech recognition is a necessity, as this skill relies on the Azure cloud service for processing.
- If you do not have an Azure subscription or prefer not to use Azure Cognitive Services for speech transcription.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/azure-speech-to-text-rest-py/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How azure-speech-to-text-rest-py Compares
| Feature / Agent | azure-speech-to-text-rest-py | Standard Approach |
|---|---|---|
| Platform Support | Claude | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | medium | N/A |
Frequently Asked Questions
What does this skill do?
Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.
Which AI agents support this skill?
This skill is designed for Claude.
How difficult is it to install?
The installation complexity is rated as medium. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
SKILL.md Source
# Azure Speech to Text REST API for Short Audio
Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests.
## Prerequisites
1. **Azure subscription** - [Create one free](https://azure.microsoft.com/free/)
2. **Speech resource** - Create in [Azure Portal](https://portal.azure.com/#create/Microsoft.CognitiveServicesSpeechServices)
3. **Get credentials** - After deployment, go to resource > Keys and Endpoint
## Environment Variables
```bash
# Required
AZURE_SPEECH_KEY=<your-speech-resource-key>
AZURE_SPEECH_REGION=<region> # e.g., eastus, westus2, westeurope
# Alternative: Use endpoint directly
AZURE_SPEECH_ENDPOINT=https://<region>.stt.speech.microsoft.com
```
## Installation
```bash
pip install requests
```
## Quick Start
```python
import os
import requests
def transcribe_audio(audio_file_path: str, language: str = "en-US") -> dict:
"""Transcribe short audio file (max 60 seconds) using REST API."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
params = {
"language": language,
"format": "detailed" # or "simple"
}
with open(audio_file_path, "rb") as audio_file:
response = requests.post(url, headers=headers, params=params, data=audio_file)
response.raise_for_status()
return response.json()
# Usage
result = transcribe_audio("audio.wav", "en-US")
print(result["DisplayText"])
```
## Audio Requirements
| Format | Codec | Sample Rate | Notes |
|--------|-------|-------------|-------|
| WAV | PCM | 16 kHz, mono | **Recommended** |
| OGG | OPUS | 16 kHz, mono | Smaller file size |
**Limitations:**
- Maximum 60 seconds of audio
- For pronunciation assessment: maximum 30 seconds
- No partial/interim results (final only)
## Content-Type Headers
```python
# WAV PCM 16kHz
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000"
# OGG OPUS
"Content-Type": "audio/ogg; codecs=opus"
```
## Response Formats
### Simple Format (default)
```python
params = {"language": "en-US", "format": "simple"}
```
```json
{
"RecognitionStatus": "Success",
"DisplayText": "Remind me to buy 5 pencils.",
"Offset": "1236645672289",
"Duration": "1236645672289"
}
```
### Detailed Format
```python
params = {"language": "en-US", "format": "detailed"}
```
```json
{
"RecognitionStatus": "Success",
"Offset": "1236645672289",
"Duration": "1236645672289",
"NBest": [
{
"Confidence": 0.9052885,
"Display": "What's the weather like?",
"ITN": "what's the weather like",
"Lexical": "what's the weather like",
"MaskedITN": "what's the weather like"
}
]
}
```
## Chunked Transfer (Recommended)
For lower latency, stream audio in chunks:
```python
import os
import requests
def transcribe_chunked(audio_file_path: str, language: str = "en-US") -> dict:
"""Stream audio in chunks for lower latency."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json",
"Transfer-Encoding": "chunked",
"Expect": "100-continue"
}
params = {"language": language, "format": "detailed"}
def generate_chunks(file_path: str, chunk_size: int = 1024):
with open(file_path, "rb") as f:
while chunk := f.read(chunk_size):
yield chunk
response = requests.post(
url,
headers=headers,
params=params,
data=generate_chunks(audio_file_path)
)
response.raise_for_status()
return response.json()
```
## Authentication Options
### Option 1: Subscription Key (Simple)
```python
headers = {
"Ocp-Apim-Subscription-Key": os.environ["AZURE_SPEECH_KEY"]
}
```
### Option 2: Bearer Token
```python
import requests
import os
def get_access_token() -> str:
"""Get access token from the token endpoint."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
token_url = f"https://{region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"
response = requests.post(
token_url,
headers={
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "application/x-www-form-urlencoded",
"Content-Length": "0"
}
)
response.raise_for_status()
return response.text
# Use token in requests (valid for 10 minutes)
token = get_access_token()
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
```
## Query Parameters
| Parameter | Required | Values | Description |
|-----------|----------|--------|-------------|
| `language` | **Yes** | `en-US`, `de-DE`, etc. | Language of speech |
| `format` | No | `simple`, `detailed` | Result format (default: simple) |
| `profanity` | No | `masked`, `removed`, `raw` | Profanity handling (default: masked) |
## Recognition Status Values
| Status | Description |
|--------|-------------|
| `Success` | Recognition succeeded |
| `NoMatch` | Speech detected but no words matched |
| `InitialSilenceTimeout` | Only silence detected |
| `BabbleTimeout` | Only noise detected |
| `Error` | Internal service error |
## Profanity Handling
```python
# Mask profanity with asterisks (default)
params = {"language": "en-US", "profanity": "masked"}
# Remove profanity entirely
params = {"language": "en-US", "profanity": "removed"}
# Include profanity as-is
params = {"language": "en-US", "profanity": "raw"}
```
## Error Handling
```python
import requests
def transcribe_with_error_handling(audio_path: str, language: str = "en-US") -> dict | None:
"""Transcribe with proper error handling."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
try:
with open(audio_path, "rb") as audio_file:
response = requests.post(
url,
headers={
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
},
params={"language": language, "format": "detailed"},
data=audio_file
)
if response.status_code == 200:
result = response.json()
if result.get("RecognitionStatus") == "Success":
return result
else:
print(f"Recognition failed: {result.get('RecognitionStatus')}")
return None
elif response.status_code == 400:
print(f"Bad request: Check language code or audio format")
elif response.status_code == 401:
print(f"Unauthorized: Check API key or token")
elif response.status_code == 403:
print(f"Forbidden: Missing authorization header")
else:
print(f"Error {response.status_code}: {response.text}")
return None
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
```
## Async Version
```python
import os
import aiohttp
import asyncio
async def transcribe_async(audio_file_path: str, language: str = "en-US") -> dict:
"""Async version using aiohttp."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
params = {"language": language, "format": "detailed"}
async with aiohttp.ClientSession() as session:
with open(audio_file_path, "rb") as f:
audio_data = f.read()
async with session.post(url, headers=headers, params=params, data=audio_data) as response:
response.raise_for_status()
return await response.json()
# Usage
result = asyncio.run(transcribe_async("audio.wav", "en-US"))
print(result["DisplayText"])
```
## Supported Languages
Common language codes (see [full list](https://learn.microsoft.com/azure/ai-services/speech-service/language-support)):
| Code | Language |
|------|----------|
| `en-US` | English (US) |
| `en-GB` | English (UK) |
| `de-DE` | German |
| `fr-FR` | French |
| `es-ES` | Spanish (Spain) |
| `es-MX` | Spanish (Mexico) |
| `zh-CN` | Chinese (Mandarin) |
| `ja-JP` | Japanese |
| `ko-KR` | Korean |
| `pt-BR` | Portuguese (Brazil) |
## Best Practices
1. **Use WAV PCM 16kHz mono** for best compatibility
2. **Enable chunked transfer** for lower latency
3. **Cache access tokens** for 9 minutes (valid for 10)
4. **Specify the correct language** for accurate recognition
5. **Use detailed format** when you need confidence scores
6. **Handle all RecognitionStatus values** in production code
## When NOT to Use This API
Use the Speech SDK or Batch Transcription API instead when you need:
- Audio longer than 60 seconds
- Real-time streaming transcription
- Partial/interim results
- Speech translation
- Custom speech models
- Batch transcription of many files
## Reference Files
| File | Contents |
|------|----------|
| references/pronunciation-assessment.md | Pronunciation assessment parameters and scoring |
## When to Use
This skill is applicable to execute the workflow or actions described in the overview.Related Skills
microsoft-azure-webjobs-extensions-authentication-events-dotnet
Microsoft Entra Authentication Events SDK for .NET. Azure Functions triggers for custom authentication extensions.
hig-project-context
Create or update a shared Apple design context document that other HIG skills use to tailor guidance.
filesystem-context
Use for file-based context management, dynamic context discovery, and reducing context window bloat. Offload context to files for just-in-time loading.
ddd-context-mapping
Map relationships between bounded contexts and define integration contracts using DDD context mapping patterns.
context7-auto-research
Automatically fetch latest library/framework documentation for Claude Code via Context7 API. Use when you need up-to-date documentation for libraries and frameworks or asking about React, Next.js, Prisma, or any other popular library.
context-optimization
Context optimization extends the effective capacity of limited context windows through strategic compression, masking, caching, and partitioning. The goal is not to magically increase context windows but to make better use of available capacity.
context-management-context-save
Use when working with context management context save
context-management-context-restore
Use when working with context management context restore
context-guardian
Guardiao de contexto que preserva dados criticos antes da compactacao automatica. Snapshots, verificacao de integridade e zero perda de informacao.
context-driven-development
Guide for implementing and maintaining context as a managed artifact alongside code, enabling consistent AI interactions and team alignment through structured project documentation.
context-compression
When agent sessions generate millions of tokens of conversation history, compression becomes mandatory. The naive approach is aggressive compression to minimize tokens per request.
context-agent
Agente de contexto para continuidade entre sessoes. Salva resumos, decisoes, tarefas pendentes e carrega briefing automatico na sessao seguinte.