whisper-transcribe
Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.
About this skill
This skill empowers AI agents to accurately transcribe audio and video files into text using the powerful OpenAI Whisper Command Line Interface. Its standout feature is "contextual grounding," which significantly enhances transcription accuracy. By leveraging markdown files located in the same directory as the media, the skill guides Whisper to correctly identify and render technical terms, proper nouns, and domain-specific vocabulary that might otherwise be misinterpreted. Users can employ this skill for a wide range of tasks, from converting recordings of meetings and interviews into precise text transcripts to processing various media files containing specialized jargon. The ability to provide context makes it an invaluable tool for fields requiring high fidelity in documentation, such as scientific research, legal proceedings, or technical discussions, ensuring that specialized language is captured correctly. This skill is designed for AI agents to interact with the underlying Whisper CLI, managing the necessary installations and passing the appropriate media and context files to generate a robust and reliable text output.
Best use case
The primary use case is generating highly accurate text transcripts from audio and video, particularly when the content includes specialized terminology, unique names, or complex jargon. This skill is most beneficial for professionals like researchers, journalists, content creators, and corporate teams who frequently work with interviews, meetings, lectures, or other media containing domain-specific language, where standard transcription might falter.
Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.
A highly accurate text transcript of the provided audio or video file, significantly improved by local contextual markdown for specialized terminology.
Practical example
Example input
Transcribe the `quarterly_review.mp4` video. Use `product_names.md` and `team_members.md` for context to ensure proper naming of our new product 'Project Orion' and 'Sarah Chen'.
Example output
The transcript for `quarterly_review.mp4` is: '...our new product, Project Orion, saw significant traction this quarter. Sarah Chen highlighted the improvements in the Andromeda module's user interface during her presentation...'
When to use this skill
- User asks to transcribe an audio or video file.
- User wants to convert a recording to text, especially with domain-specific terms.
- User needs meeting notes or interview transcripts.
- User mentions "whisper," "speech to text," or specific media file types in relation to transcription.
When not to use this skill
- User needs real-time, live transcription.
- User requires complex audio editing or manipulation beyond transcription.
- User needs to transcribe in a language not well-supported by OpenAI Whisper.
- User is processing very short, simple phrases where advanced contextual grounding is unnecessary.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/whisper-transcribe/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How whisper-transcribe Compares
| Feature / Agent | whisper-transcribe | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | medium | N/A |
Frequently Asked Questions
What does this skill do?
Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.
How difficult is it to install?
The installation complexity is rated as medium. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
AI Agent for YouTube Script Writing
Find AI agent skills for YouTube script writing, video research, content outlining, and repeatable channel production workflows.
SKILL.md Source
# Whisper Transcribe Skill Transcribe audio and video files to text using OpenAI's Whisper with contextual grounding from markdown files. ## Purpose Intelligent audio/video transcription that: 1. Converts media files to accurate text transcripts 2. Uses markdown context files to correct technical terms, names, and jargon 3. Handles various audio/video formats (mp3, wav, m4a, mp4, webm, etc.) ## When to Use - User asks to transcribe an audio or video file - User wants to convert a recording to text - User mentions "whisper" in context of transcription - User needs meeting notes or interview transcripts - User has media files with domain-specific terminology ## Installation ### macOS (Recommended for MacBook Pro) ```bash # Install via Homebrew (recommended) brew install ffmpeg openai-whisper # Verify installation whisper --version ``` ### Linux/pip Installation ```bash # Install ffmpeg first sudo apt install ffmpeg # Debian/Ubuntu # or: sudo dnf install ffmpeg # Fedora # Install Whisper pip install openai-whisper ``` ### Verify Installation ```bash whisper --version ffmpeg -version ``` ## Transcription Workflow ### Step 1: Identify Media File and Context 1. Locate the audio/video file to transcribe 2. Check for markdown files in the same directory (context files) 3. If no context files exist, optionally create one using `assets/context-template.md` ### Step 2: Run Whisper Transcription Basic transcription: ```bash whisper "/path/to/audio.mp3" --output_dir "/path/to/output" ``` With model selection (trade-off: speed vs accuracy): ```bash # Fast (less accurate) whisper "audio.mp3" --model tiny # Balanced (recommended) whisper "audio.mp3" --model base # High quality whisper "audio.mp3" --model small # Best quality (slower, requires more RAM) whisper "audio.mp3" --model medium whisper "audio.mp3" --model large ``` With language specification: ```bash whisper "audio.mp3" --language en ``` Output format options: ```bash whisper "audio.mp3" --output_format txt # Plain text whisper "audio.mp3" --output_format srt # Subtitles whisper "audio.mp3" --output_format vtt # Web subtitles whisper "audio.mp3" --output_format json # Detailed JSON whisper "audio.mp3" --output_format all # All formats ``` ### Step 3: Apply Context Grounding Use the `scripts/transcribe_with_context.py` script for automated grounding, or manually apply corrections: ```bash # Automated approach (recommended) python scripts/transcribe_with_context.py /path/to/audio.mp3 ``` For manual grounding: 1. Read the transcript output 2. Read all `.md` files in the media file's directory 3. Extract terminology, names, and technical terms from context files 4. Search transcript for likely misrecognitions 5. Apply corrections based on context **Common corrections:** - "cooler net ease" -> "Kubernetes" - "sequel" -> "SQL" - "post gress" -> "Postgres" - Names: Match phonetic variations to names in context files ### Step 4: Save Corrected Transcript Save the grounded transcript with a clear filename: ``` original_filename_transcript.txt original_filename_transcript.md ``` ## Context Files Context files are markdown files in the same directory as the media file. They provide grounding information to improve transcription accuracy. ### What to Include in Context Files - **People**: Names of speakers, team members, interviewees - **Technical Terms**: Domain-specific vocabulary, product names - **Acronyms**: Abbreviations and their expansions - **Organizations**: Company names, department names - **Projects**: Project codenames, feature names ### Context File Example See `assets/context-template.md` for a complete template. ```markdown # Meeting Context ## Speakers - Richard Hightower (host) - Jane Smith (engineering lead) ## Technical Terms - Kubernetes (container orchestration) - FastAPI (Python web framework) - AlloyDB (Google Cloud database) ## Acronyms - CI/CD - Continuous Integration/Continuous Deployment - PR - Pull Request ``` ## Model Selection Guide Use `base` for general use, `medium` for important recordings. See `references/whisper-options.md` for full model comparison and all available options. **Quick reference:** `tiny` (fastest) < `base` (balanced) < `small` (better) < `medium` (high) < `large` (best accuracy) For MacBook Pro with Apple Silicon: `small` or `medium` models recommended for best speed/accuracy balance. ## Troubleshooting ### "whisper: command not found" ```bash # macOS brew install openai-whisper # Linux pip install openai-whisper export PATH="$HOME/.local/bin:$PATH" ``` ### "ffmpeg not found" ```bash # macOS brew install ffmpeg # Linux sudo apt install ffmpeg ``` ### Out of memory errors Use a smaller model: ```bash whisper "audio.mp3" --model tiny ``` ### Slow transcription - Use `tiny` or `base` model for faster results - Ensure correct architecture is being used (Apple Silicon vs Intel) ## Resources ### scripts/ The `scripts/transcribe_with_context.py` script automates the full workflow: - Finds context files automatically - Runs Whisper transcription - Applies context-based corrections - Saves the final transcript Usage: ```bash python scripts/transcribe_with_context.py /path/to/audio.mp3 ``` ### references/ See `references/whisper-options.md` for complete CLI reference and advanced options. ### assets/ The `assets/context-template.md` provides a template for creating context files to improve transcription accuracy.
Related Skills
videodb
视频与音频的查看、理解与行动。查看:从本地文件、URL、RTSP/直播源或实时录制桌面获取内容;返回实时上下文和可播放流链接。理解:提取帧,构建视觉/语义/时间索引,并通过时间戳和自动剪辑搜索片段。行动:转码和标准化(编解码器、帧率、分辨率、宽高比),执行时间线编辑(字幕、文本/图像叠加、品牌化、音频叠加、配音、翻译),生成媒体资源(图像、音频、视频),并为直播流或桌面捕获的事件创建实时警报。
lets-go-rss
A lightweight, full-platform RSS subscription manager that aggregates content from YouTube, Vimeo, Behance, Twitter/X, and Chinese platforms like Bilibili, Weibo, and Douyin, featuring deduplication and AI smart classification.
thor-skills
An entry point and router for AI agents to manage various THOR-related cybersecurity tasks, including running scans, analyzing logs, troubleshooting, and maintenance.
ux
This AI agent skill provides comprehensive guidance for creating professional and insightful User Experience (UX) designs, covering user research, information architecture, interaction design, visual guidance, and usability evaluation. It aims to produce actionable, user-centered solutions that avoid generic AI aesthetics.
tech-blog
Generates comprehensive technical blog posts, offering detailed explanations of system internals, architecture, and implementation, either through source code analysis or document-driven research.
modal-deployment
Run Python code in the cloud with serverless containers, GPUs, and autoscaling using Modal. This skill enables agents to generate code for deploying ML models, running batch jobs, serving APIs, and scaling compute-intensive workloads.
vly-money
Generate crypto payment links for supported tokens and networks, manage access to X402 payment-protected content, and provide direct access to the vly.money wallet interface.
astro
This skill provides essential Astro framework patterns, focusing on server-side rendering (SSR), static site generation (SSG), middleware, and TypeScript best practices. It helps AI agents implement secure authentication, manage API routes, and debug rendering behaviors within Astro projects.
grail-miner
This skill assists in setting up, managing, and optimizing Grail miners on Bittensor Subnet 81, handling tasks like environment configuration, R2 storage, model checkpoint management, and performance tuning.
ontopo
An AI agent skill to search for Israeli restaurants, check table availability, view menus, and retrieve booking links via the Ontopo platform, acting as an unofficial interface to its data.
chrome-debug
This skill empowers AI agents to debug web applications and inspect browser behavior using the Chrome DevTools Protocol (CDP), offering both collaborative (headful) and automated (headless) modes.
advanced-skill-creator
Meta-skill that generates domain-specific skills using advanced reasoning techniques. PROACTIVELY activate for: (1) Create/build/make skills, (2) Generate expert panels for any domain, (3) Design evaluation frameworks, (4) Create research workflows, (5) Structure complex multi-step processes, (6) Instantiate templates with parameters. Triggers: "create a skill for", "build evaluation for", "design workflow for", "generate expert panel for", "how should I approach [complex task]", "create skill", "new skill for", "skill template", "generate skill"