azure-speech-to-text-rest-py
Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.
Best use case
azure-speech-to-text-rest-py is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.
Teams using azure-speech-to-text-rest-py should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/azure-speech-to-text-rest-py/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How azure-speech-to-text-rest-py Compares
| Feature / Agent | azure-speech-to-text-rest-py | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Azure Speech to Text REST API for Short Audio
Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests.
## Prerequisites
1. **Azure subscription** - [Create one free](https://azure.microsoft.com/free/)
2. **Speech resource** - Create in [Azure Portal](https://portal.azure.com/#create/Microsoft.CognitiveServicesSpeechServices)
3. **Get credentials** - After deployment, go to resource > Keys and Endpoint
## Environment Variables
```bash
# Required
AZURE_SPEECH_KEY=<your-speech-resource-key>
AZURE_SPEECH_REGION=<region> # e.g., eastus, westus2, westeurope
# Alternative: Use endpoint directly
AZURE_SPEECH_ENDPOINT=https://<region>.stt.speech.microsoft.com
```
## Installation
```bash
pip install requests
```
## Quick Start
```python
import os
import requests
def transcribe_audio(audio_file_path: str, language: str = "en-US") -> dict:
"""Transcribe short audio file (max 60 seconds) using REST API."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
params = {
"language": language,
"format": "detailed" # or "simple"
}
with open(audio_file_path, "rb") as audio_file:
response = requests.post(url, headers=headers, params=params, data=audio_file)
response.raise_for_status()
return response.json()
# Usage
result = transcribe_audio("audio.wav", "en-US")
print(result["DisplayText"])
```
## Audio Requirements
| Format | Codec | Sample Rate | Notes |
|--------|-------|-------------|-------|
| WAV | PCM | 16 kHz, mono | **Recommended** |
| OGG | OPUS | 16 kHz, mono | Smaller file size |
**Limitations:**
- Maximum 60 seconds of audio
- For pronunciation assessment: maximum 30 seconds
- No partial/interim results (final only)
## Content-Type Headers
```python
# WAV PCM 16kHz
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000"
# OGG OPUS
"Content-Type": "audio/ogg; codecs=opus"
```
## Response Formats
### Simple Format (default)
```python
params = {"language": "en-US", "format": "simple"}
```
```json
{
"RecognitionStatus": "Success",
"DisplayText": "Remind me to buy 5 pencils.",
"Offset": "1236645672289",
"Duration": "1236645672289"
}
```
### Detailed Format
```python
params = {"language": "en-US", "format": "detailed"}
```
```json
{
"RecognitionStatus": "Success",
"Offset": "1236645672289",
"Duration": "1236645672289",
"NBest": [
{
"Confidence": 0.9052885,
"Display": "What's the weather like?",
"ITN": "what's the weather like",
"Lexical": "what's the weather like",
"MaskedITN": "what's the weather like"
}
]
}
```
## Chunked Transfer (Recommended)
For lower latency, stream audio in chunks:
```python
import os
import requests
def transcribe_chunked(audio_file_path: str, language: str = "en-US") -> dict:
"""Stream audio in chunks for lower latency."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json",
"Transfer-Encoding": "chunked",
"Expect": "100-continue"
}
params = {"language": language, "format": "detailed"}
def generate_chunks(file_path: str, chunk_size: int = 1024):
with open(file_path, "rb") as f:
while chunk := f.read(chunk_size):
yield chunk
response = requests.post(
url,
headers=headers,
params=params,
data=generate_chunks(audio_file_path)
)
response.raise_for_status()
return response.json()
```
## Authentication Options
### Option 1: Subscription Key (Simple)
```python
headers = {
"Ocp-Apim-Subscription-Key": os.environ["AZURE_SPEECH_KEY"]
}
```
### Option 2: Bearer Token
```python
import requests
import os
def get_access_token() -> str:
"""Get access token from the token endpoint."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
token_url = f"https://{region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"
response = requests.post(
token_url,
headers={
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "application/x-www-form-urlencoded",
"Content-Length": "0"
}
)
response.raise_for_status()
return response.text
# Use token in requests (valid for 10 minutes)
token = get_access_token()
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
```
## Query Parameters
| Parameter | Required | Values | Description |
|-----------|----------|--------|-------------|
| `language` | **Yes** | `en-US`, `de-DE`, etc. | Language of speech |
| `format` | No | `simple`, `detailed` | Result format (default: simple) |
| `profanity` | No | `masked`, `removed`, `raw` | Profanity handling (default: masked) |
## Recognition Status Values
| Status | Description |
|--------|-------------|
| `Success` | Recognition succeeded |
| `NoMatch` | Speech detected but no words matched |
| `InitialSilenceTimeout` | Only silence detected |
| `BabbleTimeout` | Only noise detected |
| `Error` | Internal service error |
## Profanity Handling
```python
# Mask profanity with asterisks (default)
params = {"language": "en-US", "profanity": "masked"}
# Remove profanity entirely
params = {"language": "en-US", "profanity": "removed"}
# Include profanity as-is
params = {"language": "en-US", "profanity": "raw"}
```
## Error Handling
```python
import requests
def transcribe_with_error_handling(audio_path: str, language: str = "en-US") -> dict | None:
"""Transcribe with proper error handling."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
try:
with open(audio_path, "rb") as audio_file:
response = requests.post(
url,
headers={
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
},
params={"language": language, "format": "detailed"},
data=audio_file
)
if response.status_code == 200:
result = response.json()
if result.get("RecognitionStatus") == "Success":
return result
else:
print(f"Recognition failed: {result.get('RecognitionStatus')}")
return None
elif response.status_code == 400:
print(f"Bad request: Check language code or audio format")
elif response.status_code == 401:
print(f"Unauthorized: Check API key or token")
elif response.status_code == 403:
print(f"Forbidden: Missing authorization header")
else:
print(f"Error {response.status_code}: {response.text}")
return None
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
```
## Async Version
```python
import os
import aiohttp
import asyncio
async def transcribe_async(audio_file_path: str, language: str = "en-US") -> dict:
"""Async version using aiohttp."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
params = {"language": language, "format": "detailed"}
async with aiohttp.ClientSession() as session:
with open(audio_file_path, "rb") as f:
audio_data = f.read()
async with session.post(url, headers=headers, params=params, data=audio_data) as response:
response.raise_for_status()
return await response.json()
# Usage
result = asyncio.run(transcribe_async("audio.wav", "en-US"))
print(result["DisplayText"])
```
## Supported Languages
Common language codes (see [full list](https://learn.microsoft.com/azure/ai-services/speech-service/language-support)):
| Code | Language |
|------|----------|
| `en-US` | English (US) |
| `en-GB` | English (UK) |
| `de-DE` | German |
| `fr-FR` | French |
| `es-ES` | Spanish (Spain) |
| `es-MX` | Spanish (Mexico) |
| `zh-CN` | Chinese (Mandarin) |
| `ja-JP` | Japanese |
| `ko-KR` | Korean |
| `pt-BR` | Portuguese (Brazil) |
## Best Practices
1. **Use WAV PCM 16kHz mono** for best compatibility
2. **Enable chunked transfer** for lower latency
3. **Cache access tokens** for 9 minutes (valid for 10)
4. **Specify the correct language** for accurate recognition
5. **Use detailed format** when you need confidence scores
6. **Handle all RecognitionStatus values** in production code
## When NOT to Use This API
Use the Speech SDK or Batch Transcription API instead when you need:
- Audio longer than 60 seconds
- Real-time streaming transcription
- Partial/interim results
- Speech translation
- Custom speech models
- Batch transcription of many files
## Reference Files
| File | Contents |
|------|----------|
| references/pronunciation-assessment.md | Pronunciation assessment parameters and scoring |
## When to Use
This skill is applicable to execute the workflow or actions described in the overview.Related Skills
extracting-ai-context
Extracts and manages AI context (skills, AGENTS.md) from workflow-kotlin library JARs. Use when setting up AI tooling for a workflow-kotlin project, updating skills after a library version change, or configuring agent-specific directories.
create-agent-with-sanity-context
Build AI agents with structured access to Sanity content via Context MCP. Covers Studio setup, agent implementation, and advanced patterns like client-side tools and custom rendering.
context-optimizer
Analyzes Copilot Chat debug logs, agent definitions, skills, and instruction files to audit context window utilization. Provides log parsing, turn-cost profiling, redundancy detection, hand-off gap analysis, and optimization recommendations. Use when optimizing agent context efficiency, identifying where to add subagent hand-offs, or reducing token waste across agent systems.
context-fundamentals
Understand the components, mechanics, and constraints of context in agent systems. Use when designing agent architectures, debugging context-related failures, or optimizing context usage.
context-engineering
Use when designing agent system prompts, optimizing RAG retrieval, or when context is too expensive or slow. Reduces tokens while maintaining quality through strategic positioning and attention-aware design.
context-degradation
Recognize patterns of context failure: lost-in-middle, poisoning, distraction, and clash
context-assembler
Assembles relevant context for agent spawns with prioritized ranking. Ranks packages by relevance, enforces token budgets with graduated zones, captures error patterns for learning, and supports configurable per-agent retrieval limits.
Codebase context
Create a lightweight codebase_context.md that anchors the idea in the existing repo (modules, constraints, extension points). Generic framework prompt.
azure-storage-file-datalake-py
Azure Data Lake Storage Gen2 SDK for Python. Use for hierarchical file systems, big data analytics, and file/directory operations.
alttext-ai-automation
Automate Alttext AI tasks via Rube MCP (Composio). Always search tools first for current schemas.
agent-context-system
A persistent local-only memory system for AI coding agents. Two files, one idea — AGENTS.md (committed, shared) + .agents.local.md (gitignored, personal). Agents read both at session start, update the scratchpad at session end, and promote stable patterns over time. Works across Claude Code, Cursor, Copilot, Windsurf. Subagent-ready. No plugins, no infrastructure, no background processes.
add-route-context
为Flutter页面添加路由上下文记录功能,支持日期等参数的AI上下文识别。当需要让AI助手通过"询问当前上下文"功能获取页面状态(如日期、ID等参数)时使用。适用场景:(1) 日期驱动的页面(日记、活动、日历等),(2) ID驱动的页面(用户详情、订单详情等),(3) 任何需要AI理解当前页面参数的场景