azure-ai-transcription-py

Azure AI Transcription SDK for Python. Use for real-time and batch speech-to-text transcription with timestamps and diarization.

31,392 stars
Complexity: easy

About this skill

This skill integrates the Azure AI Transcription SDK for Python, empowering AI agents to accurately convert spoken language from audio into written text. It supports both real-time audio streams and batch processing of pre-recorded audio files, making it versatile for a wide array of applications. Key features include precise word-level timestamps, enabling detailed analysis of spoken content, and speaker diarization, which identifies and separates individual speakers in a conversation. This allows AI agents to process complex audio dialogues and extract structured information from speech efficiently and reliably, harnessing Microsoft's advanced AI capabilities.

Best use case

Transcribing meeting recordings, interviews, or lectures to generate searchable text and identify speakers. Enabling voice control interfaces or conversational AI agents to understand user commands and queries from live audio. Analyzing customer service calls for sentiment, key topics, and agent-customer interaction patterns. Processing multimedia content (videos, podcasts) to create accurate captions, subtitles, or comprehensive searchable transcripts. Providing accessibility features by converting spoken content into text for individuals with hearing impairments.

Azure AI Transcription SDK for Python. Use for real-time and batch speech-to-text transcription with timestamps and diarization.

The AI agent will receive an accurate textual transcription of the provided audio input, typically including word-level timestamps and speaker segmentation (diarization) if enabled. The output is often structured data (e.g., JSON) representing the spoken content, ready for further natural language processing, analysis, or display.

Practical example

Example input

Transcribe the audio located at 's3://my-bucket/customer_call_20230101.wav' in English, identifying different speakers and providing timestamps for each spoken word.

Example output

```json
{
  "status": "success",
  "transcript": "Hello, my name is John. (Speaker 1) How can I help you? (Speaker 2) I'm having trouble with my account. (Speaker 1)",
  "words": [
    {"word": "Hello", "start": "0.10s", "end": "0.50s", "speaker": 1},
    {"word": "my", "start": "0.55s", "end": "0.65s", "speaker": 1},
    {"word": "name", "start": "0.70s", "end": "0.90s", "speaker": 1},
    {"word": "is", "start": "0.95s", "end": "1.05s", "speaker": 1},
    {"word": "John.", "start": "1.10s", "end": "1.50s", "speaker": 1},
    {"word": "How", "start": "2.00s", "end": "2.20s", "speaker": 2},
    {"word": "can", "start": "2.25s", "end": "2.35s", "speaker": 2},
    {"word": "I", "start": "2.40s", "end": "2.50s", "speaker": 2},
    {"word": "help", "start": "2.55s", "end": "2.75s", "speaker": 2},
    {"word": "you?", "start": "2.80s", "end": "3.00s", "speaker": 2},
    {"word": "I'm", "start": "3.50s", "end": "3.70s", "speaker": 1},
    {"word": "having", "start": "3.75s", "end": "4.00s", "speaker": 1},
    {"word": "trouble", "start": "4.05s", "end": "4.35s", "speaker": 1},
    {"word": "with", "start": "4.40s", "end": "4.55s", "speaker": 1},
    {"word": "my", "start": "4.60s", "end": "4.70s", "speaker": 1},
    {"word": "account.", "start": "4.75s", "end": "5.20s", "speaker": 1}
  ],
  "language": "en-US",
  "diarization_enabled": true
}
```

When to use this skill

  • When an AI agent needs to process audio input (live streams or recorded files) and convert it into textual data.
  • When high accuracy speech-to-text transcription is required, potentially with support for various languages and custom vocabulary.
  • When identifying different speakers in a multi-participant conversation is crucial for understanding context and attributing statements.
  • When precise timing of spoken words (timestamps) is necessary for further analysis, synchronization with video, or content editing.

When not to use this skill

  • When transcription needs to be performed entirely offline or on-device without any dependency on a cloud service.
  • When strict privacy regulations or security policies prohibit sending audio data to a third-party cloud service.
  • When only very basic, small-scale transcription is needed and integrating a full SDK might be overkill compared to simpler, local alternatives (if sufficient).
  • If the primary requirement is text-to-speech (TTS) synthesis, rather than speech-to-text transcription.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/azure-ai-transcription-py/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/azure-ai-transcription-py/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/azure-ai-transcription-py/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How azure-ai-transcription-py Compares

Feature / Agentazure-ai-transcription-pyStandard Approach
Platform SupportClaudeLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

Azure AI Transcription SDK for Python. Use for real-time and batch speech-to-text transcription with timestamps and diarization.

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Azure AI Transcription SDK for Python

Client library for Azure AI Transcription (speech-to-text) with real-time and batch transcription.

## Installation

```bash
pip install azure-ai-transcription
```

## Environment Variables

```bash
TRANSCRIPTION_ENDPOINT=https://<resource>.cognitiveservices.azure.com
TRANSCRIPTION_KEY=<your-key>
```

## Authentication

Use subscription key authentication (DefaultAzureCredential is not supported for this client):

```python
import os
from azure.ai.transcription import TranscriptionClient

client = TranscriptionClient(
    endpoint=os.environ["TRANSCRIPTION_ENDPOINT"],
    credential=os.environ["TRANSCRIPTION_KEY"]
)
```

## Transcription (Batch)

```python
job = client.begin_transcription(
    name="meeting-transcription",
    locale="en-US",
    content_urls=["https://<storage>/audio.wav"],
    diarization_enabled=True
)
result = job.result()
print(result.status)
```

## Transcription (Real-time)

```python
stream = client.begin_stream_transcription(locale="en-US")
stream.send_audio_file("audio.wav")
for event in stream:
    print(event.text)
```

## Best Practices

1. **Enable diarization** when multiple speakers are present
2. **Use batch transcription** for long files stored in blob storage
3. **Capture timestamps** for subtitle generation
4. **Specify language** to improve recognition accuracy
5. **Handle streaming backpressure** for real-time transcription
6. **Close transcription sessions** when complete

## When to Use
This skill is applicable to execute the workflow or actions described in the overview.

Related Skills

fal-audio

31392
from sickn33/antigravity-awesome-skills

Text-to-speech and speech-to-text using fal.ai audio models

Audio ProcessingClaude

microsoft-azure-webjobs-extensions-authentication-events-dotnet

31392
from sickn33/antigravity-awesome-skills

Microsoft Entra Authentication Events SDK for .NET. Azure Functions triggers for custom authentication extensions.

Identity Management / Authentication & AuthorizationClaude

azure-web-pubsub-ts

31392
from sickn33/antigravity-awesome-skills

Real-time messaging with WebSocket connections and pub/sub patterns.

Messaging & CommunicationClaude

azure-storage-queue-ts

31392
from sickn33/antigravity-awesome-skills

Azure Queue Storage JavaScript/TypeScript SDK (@azure/storage-queue) for message queue operations. Use for sending, receiving, peeking, and deleting messages in queues.

Cloud IntegrationClaude

azure-storage-queue-py

31392
from sickn33/antigravity-awesome-skills

Azure Queue Storage SDK for Python. Use for reliable message queuing, task distribution, and asynchronous processing.

Cloud IntegrationClaude

azure-storage-file-share-ts

31392
from sickn33/antigravity-awesome-skills

Azure File Share JavaScript/TypeScript SDK (@azure/storage-file-share) for SMB file share operations.

Cloud Storage ManagementClaude

azure-storage-file-share-py

31392
from sickn33/antigravity-awesome-skills

Azure Storage File Share SDK for Python. Use for SMB file shares, directories, and file operations in the cloud.

Cloud Storage ManagementClaude

azure-storage-file-datalake-py

31392
from sickn33/antigravity-awesome-skills

Azure Data Lake Storage Gen2 SDK for Python. Use for hierarchical file systems, big data analytics, and file/directory operations.

Cloud Storage ManagementClaude

azure-storage-blob-ts

31392
from sickn33/antigravity-awesome-skills

Azure Blob Storage JavaScript/TypeScript SDK (@azure/storage-blob) for blob operations. Use for uploading, downloading, listing, and managing blobs and containers.

Cloud Storage ManagementClaude

azure-storage-blob-rust

31392
from sickn33/antigravity-awesome-skills

Azure Blob Storage SDK for Rust. Use for uploading, downloading, and managing blobs and containers.

Cloud Storage ManagementClaude

azure-storage-blob-py

31392
from sickn33/antigravity-awesome-skills

Azure Blob Storage SDK for Python. Use for uploading, downloading, listing blobs, managing containers, and blob lifecycle.

Cloud Storage ManagementClaude

azure-speech-to-text-rest-py

31392
from sickn33/antigravity-awesome-skills

Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.

Speech-to-TextClaude