audio-to-text-and-video-to-text

Transcribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — including MP3, MP4, WAV, M4A, OGG, WEBM, MOV, AVI, FLAC, and more. Trigger this skill for any request involving: "transcribe", "convert audio to text", "speech to text", "get transcript of", "extract audio from video", "meeting notes from recording", "subtitles", "captions", or similar. Also trigger when the user uploads or references a media file and asks what was said, discussed, or mentioned in it. If unsure whether audio/video transcription is involved, use this skill.

3,891 stars

Best use case

audio-to-text-and-video-to-text is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Transcribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — including MP3, MP4, WAV, M4A, OGG, WEBM, MOV, AVI, FLAC, and more. Trigger this skill for any request involving: "transcribe", "convert audio to text", "speech to text", "get transcript of", "extract audio from video", "meeting notes from recording", "subtitles", "captions", or similar. Also trigger when the user uploads or references a media file and asks what was said, discussed, or mentioned in it. If unsure whether audio/video transcription is involved, use this skill.

Teams using audio-to-text-and-video-to-text should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/transcription/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/ahqazi-dev/audio-to-text-and-video-to-text/transcription/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/transcription/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How audio-to-text-and-video-to-text Compares

Feature / Agentaudio-to-text-and-video-to-textStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Transcribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — including MP3, MP4, WAV, M4A, OGG, WEBM, MOV, AVI, FLAC, and more. Trigger this skill for any request involving: "transcribe", "convert audio to text", "speech to text", "get transcript of", "extract audio from video", "meeting notes from recording", "subtitles", "captions", or similar. Also trigger when the user uploads or references a media file and asks what was said, discussed, or mentioned in it. If unsure whether audio/video transcription is involved, use this skill.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Transcription Skill

Converts audio and video files into clean, readable text using OpenAI's Whisper API and ffmpeg for media handling.

## Overview

This skill handles the full pipeline:
1. **Media extraction** — use ffmpeg to strip audio from video files and convert to a Whisper-compatible format
2. **Chunking** — split large files (>25 MB) into overlapping segments to stay within API limits
3. **Transcription** — send each chunk to OpenAI's Whisper API
4. **Assembly** — merge chunk transcripts, adjusting timestamps, into a single clean output
5. **Post-processing** — optionally clean up with Claude (punctuation, speaker labels, summaries)

## Requirements

- **ffmpeg** must be installed (`which ffmpeg` to verify — it's usually pre-installed in claude.ai's environment)
- **OpenAI API key** stored in the environment as `OPENAI_API_KEY` — the user must provide this
- Python packages: `openai`, `pydub` (install via pip if needed)

## Quick Start

When a user provides a media file, run the transcription script:

```bash
# Install dependencies if missing
pip install openai pydub --break-system-packages -q

# Run transcription
python /home/claude/transcription/scripts/transcribe.py \
  --input "/path/to/media/file" \
  --output "/mnt/user-data/outputs/transcript.txt" \
  --api-key "$OPENAI_API_KEY"
```

See `scripts/transcribe.py` for the full implementation.

## Supported Formats

| Category | Formats |
|----------|---------|
| Audio | mp3, wav, m4a, ogg, flac, aac, opus, wma |
| Video | mp4, mov, avi, mkv, webm, wmv, m4v |

ffmpeg handles extraction from any of these.

## Options & Flags

| Flag | Default | Description |
|------|---------|-------------|
| `--model` | `whisper-1` | Whisper model to use (`whisper-1`, `gpt-4o-transcribe`) |
| `--language` | auto-detect | ISO 639-1 language code (e.g. `en`, `ar`, `fr`) |
| `--format` | `txt` | Output format: `txt`, `srt`, `vtt`, `json` |
| `--timestamps` | off | Include timestamps in output |
| `--chunk-size` | `20` | Max chunk size in MB (must be ≤ 25) |
| `--prompt` | none | Context hint to improve accuracy (e.g. domain vocab) |

## Output Formats

- **txt** — plain text, ideal for most uses
- **srt** — SubRip subtitle format (for video players)
- **vtt** — WebVTT format (for web video)
- **json** — full Whisper JSON with segments and timestamps

## Step-by-Step Workflow

### 1. Check for the file

Ask the user to upload the file or provide a local path. Check:
```bash
ls /mnt/user-data/uploads/
```

### 2. Check ffmpeg and install deps

```bash
which ffmpeg && ffmpeg -version 2>&1 | head -1
pip install openai pydub --break-system-packages -q 2>&1 | tail -3
```

### 3. Get the API key

If `OPENAI_API_KEY` is not set in the environment, ask the user:
> "Please provide your OpenAI API key — it starts with `sk-`. You can get one at https://platform.openai.com/api-keys"

### 4. Run the script

```bash
python /home/claude/transcription/scripts/transcribe.py \
  --input "<file_path>" \
  --output "/mnt/user-data/outputs/transcript.txt"
```

### 5. Post-process (optional but recommended)

After transcription, offer to:
- **Clean up** punctuation/formatting with Claude
- **Summarize** the content
- **Extract** action items, speakers, or key topics
- **Translate** to another language

Use the transcript text directly in the conversation for these steps.

## Handling Large Files

The script automatically splits files > 20 MB into overlapping chunks (with 1-second overlap for continuity). Each chunk is transcribed separately and the results are merged.

For very long recordings (> 1 hour), warn the user it may take a few minutes and show progress.

## Error Handling

| Error | Fix |
|-------|-----|
| `AuthenticationError` | Invalid API key — ask user to verify |
| `RateLimitError` | Wait 60s and retry, or use `--chunk-size 10` |
| `InvalidRequestError: file too large` | Reduce `--chunk-size` below 25 |
| `ffmpeg not found` | `sudo apt install ffmpeg` or `brew install ffmpeg` |
| `No audio stream found` | File may be corrupt or wrong format |

## Example Interaction

```
User: "Can you transcribe this meeting recording?"
[uploads meeting.mp4]

→ Check file exists in /mnt/user-data/uploads/
→ Run transcribe.py on it
→ Save transcript to /mnt/user-data/outputs/
→ present_files() to the user
→ Offer to summarize or extract action items
```

## Notes for openclaw.ai

- Always save output to `/mnt/user-data/outputs/` so users can download it
- Use `present_files()` to share the transcript file with the user after saving
- For business users, suggest the `srt` or `vtt` format if they're adding captions to video
- The `--prompt` flag is useful for technical/domain-specific content: pass a few domain keywords to improve accuracy

Related Skills

MCP Engineering — Complete Model Context Protocol System

3891
from openclaw/skills

Build, integrate, secure, and scale MCP servers and clients. From first server to production multi-tool architecture.

AI Infrastructure & Integrations

alphashop-text

3891
from openclaw/skills

AlphaShop(遨虾)文本处理 API 工具集。支持3个接口:大模型文本翻译、 生成商品多语言卖点、生成商品多语言标题。 触发场景:翻译文本、文字翻译、多语言翻译、生成卖点、商品卖点、 多语言卖点、生成标题、商品标题、多语言标题、SEO标题、 AlphaShop文本、遨虾文本处理。

Content & Documentation

demo-video

3891
from openclaw/skills

Create product demo videos by automating browser interactions and capturing frames. Use when the user wants to record a demo, walkthrough, product showcase, or interactive video of a web application. Supports Playwright CDP screencast for high-quality capture and FFmpeg for video encoding.

Video Production

seedance-video

3891
from openclaw/skills

Generate AI videos using ByteDance Seedance. Use when the user wants to: (1) generate videos from text prompts, (2) generate videos from images (first frame, first+last frame, reference images), or (3) query/manage video generation tasks. Supports Seedance 1.5 Pro (with audio), 1.0 Pro, 1.0 Pro Fast, and 1.0 Lite models.

recipe-video-extractor

3891
from openclaw/skills

Extract a structured cooking recipe from a shared video URL when the user sends `recipe <url>`. Prioritize caption/description and comments via browser automation, then use web search/fetch as fallback with clear source attribution.

json2video-pinterest

3891
from openclaw/skills

Generate Pinterest-optimized vertical videos using JSON2Video API. Supports AI-generated or URL-based images, AI-generated or provided voiceovers, optional subtitles, and zoom effects. Use when creating video content for Pinterest affiliate marketing, creating vertical social media videos, automating video production with JSON2Video API, or generating videos with voiceovers and subtitles.

rpg-text

3891
from openclaw/skills

文字角色扮演游戏 (Text RPG) - 基于 sbordeyne/rpg-text 项目的面向对象设计,融合原始D&D规则。AI作为DM引导回合制冒险。

context-handoff

3891
from openclaw/skills

保存和恢复聊天上下文到本地文件。用于用户想在切换账号、清空 session、重新开会话、跨会话延续项目时,把当前会话级上下文或项目级摘要落盘并在之后恢复。也用于列出已有的会话上下文槽位或项目摘要,并按更新时间排序返回最近使用项。触发词包括:保存当前上下文、保存会话摘要、保存项目摘要、记下这次讨论、切号前保存、恢复上下文、恢复项目摘要、读取上次摘要、继续上次讨论、列出上下文槽位、列出已保存摘要、有哪些项目摘要、最近更新的项目摘要、按更新时间排序、session handoff, context handoff, save session context, save current context, save chat summary, save project summary, restore context, restore session context, restore project summary, continue last discussion, resume project context, list context slots, list project summaries, list saved summaries, sort by updated time, most recently updated, recently updated summaries, chat handoff, project handoff.

arch-video-cut

3891
from openclaw/skills

Automatic Architecture Video Editing Workflow with Self-Learning Preferences

short-video-script-generator-pro

3891
from openclaw/skills

AI Short Video Script Generator, support TikTok/YouTube Shorts/Instagram Reels, auto generate hook, shots, voiceover, subtitles, BGM, CTA. $0.005 USDT per use.

youtube-audio-download

3891
from openclaw/skills

Download YouTube video audio and convert to MP3. Supports age-restricted videos with cookies.

audio-play

3891
from openclaw/skills

Play audio files using Windows media player. Non-blocking execution.