whisper-transcribe

Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.

159 stars
Complexity: medium

About this skill

This skill empowers AI agents to accurately transcribe audio and video files into text using the powerful OpenAI Whisper Command Line Interface. Its standout feature is "contextual grounding," which significantly enhances transcription accuracy. By leveraging markdown files located in the same directory as the media, the skill guides Whisper to correctly identify and render technical terms, proper nouns, and domain-specific vocabulary that might otherwise be misinterpreted. Users can employ this skill for a wide range of tasks, from converting recordings of meetings and interviews into precise text transcripts to processing various media files containing specialized jargon. The ability to provide context makes it an invaluable tool for fields requiring high fidelity in documentation, such as scientific research, legal proceedings, or technical discussions, ensuring that specialized language is captured correctly. This skill is designed for AI agents to interact with the underlying Whisper CLI, managing the necessary installations and passing the appropriate media and context files to generate a robust and reliable text output.

Best use case

The primary use case is generating highly accurate text transcripts from audio and video, particularly when the content includes specialized terminology, unique names, or complex jargon. This skill is most beneficial for professionals like researchers, journalists, content creators, and corporate teams who frequently work with interviews, meetings, lectures, or other media containing domain-specific language, where standard transcription might falter.

Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.

A highly accurate text transcript of the provided audio or video file, significantly improved by local contextual markdown for specialized terminology.

Practical example

Example input

Transcribe the `quarterly_review.mp4` video. Use `product_names.md` and `team_members.md` for context to ensure proper naming of our new product 'Project Orion' and 'Sarah Chen'.

Example output

The transcript for `quarterly_review.mp4` is: '...our new product, Project Orion, saw significant traction this quarter. Sarah Chen highlighted the improvements in the Andromeda module's user interface during her presentation...'

When to use this skill

  • User asks to transcribe an audio or video file.
  • User wants to convert a recording to text, especially with domain-specific terms.
  • User needs meeting notes or interview transcripts.
  • User mentions "whisper," "speech to text," or specific media file types in relation to transcription.

When not to use this skill

  • User needs real-time, live transcription.
  • User requires complex audio editing or manipulation beyond transcription.
  • User needs to transcribe in a language not well-supported by OpenAI Whisper.
  • User is processing very short, simple phrases where advanced contextual grounding is unnecessary.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/whisper-transcribe/SKILL.md --create-dirs "https://raw.githubusercontent.com/majiayu000/claude-skill-registry/main/skills/data/whisper-transcribe/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/whisper-transcribe/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How whisper-transcribe Compares

Feature / Agentwhisper-transcribeStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexitymediumN/A

Frequently Asked Questions

What does this skill do?

Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Whisper Transcribe Skill

Transcribe audio and video files to text using OpenAI's Whisper with contextual grounding from markdown files.

## Purpose

Intelligent audio/video transcription that:
1. Converts media files to accurate text transcripts
2. Uses markdown context files to correct technical terms, names, and jargon
3. Handles various audio/video formats (mp3, wav, m4a, mp4, webm, etc.)

## When to Use

- User asks to transcribe an audio or video file
- User wants to convert a recording to text
- User mentions "whisper" in context of transcription
- User needs meeting notes or interview transcripts
- User has media files with domain-specific terminology

## Installation

### macOS (Recommended for MacBook Pro)

```bash
# Install via Homebrew (recommended)
brew install ffmpeg openai-whisper

# Verify installation
whisper --version
```

### Linux/pip Installation

```bash
# Install ffmpeg first
sudo apt install ffmpeg  # Debian/Ubuntu
# or: sudo dnf install ffmpeg  # Fedora

# Install Whisper
pip install openai-whisper
```

### Verify Installation

```bash
whisper --version
ffmpeg -version
```

## Transcription Workflow

### Step 1: Identify Media File and Context

1. Locate the audio/video file to transcribe
2. Check for markdown files in the same directory (context files)
3. If no context files exist, optionally create one using `assets/context-template.md`

### Step 2: Run Whisper Transcription

Basic transcription:
```bash
whisper "/path/to/audio.mp3" --output_dir "/path/to/output"
```

With model selection (trade-off: speed vs accuracy):
```bash
# Fast (less accurate)
whisper "audio.mp3" --model tiny

# Balanced (recommended)
whisper "audio.mp3" --model base

# High quality
whisper "audio.mp3" --model small

# Best quality (slower, requires more RAM)
whisper "audio.mp3" --model medium
whisper "audio.mp3" --model large
```

With language specification:
```bash
whisper "audio.mp3" --language en
```

Output format options:
```bash
whisper "audio.mp3" --output_format txt    # Plain text
whisper "audio.mp3" --output_format srt    # Subtitles
whisper "audio.mp3" --output_format vtt    # Web subtitles
whisper "audio.mp3" --output_format json   # Detailed JSON
whisper "audio.mp3" --output_format all    # All formats
```

### Step 3: Apply Context Grounding

Use the `scripts/transcribe_with_context.py` script for automated grounding, or manually apply corrections:

```bash
# Automated approach (recommended)
python scripts/transcribe_with_context.py /path/to/audio.mp3
```

For manual grounding:
1. Read the transcript output
2. Read all `.md` files in the media file's directory
3. Extract terminology, names, and technical terms from context files
4. Search transcript for likely misrecognitions
5. Apply corrections based on context

**Common corrections:**
- "cooler net ease" -> "Kubernetes"
- "sequel" -> "SQL"
- "post gress" -> "Postgres"
- Names: Match phonetic variations to names in context files

### Step 4: Save Corrected Transcript

Save the grounded transcript with a clear filename:
```
original_filename_transcript.txt
original_filename_transcript.md
```

## Context Files

Context files are markdown files in the same directory as the media file. They provide grounding information to improve transcription accuracy.

### What to Include in Context Files

- **People**: Names of speakers, team members, interviewees
- **Technical Terms**: Domain-specific vocabulary, product names
- **Acronyms**: Abbreviations and their expansions
- **Organizations**: Company names, department names
- **Projects**: Project codenames, feature names

### Context File Example

See `assets/context-template.md` for a complete template.

```markdown
# Meeting Context

## Speakers
- Richard Hightower (host)
- Jane Smith (engineering lead)

## Technical Terms
- Kubernetes (container orchestration)
- FastAPI (Python web framework)
- AlloyDB (Google Cloud database)

## Acronyms
- CI/CD - Continuous Integration/Continuous Deployment
- PR - Pull Request
```

## Model Selection Guide

Use `base` for general use, `medium` for important recordings. See `references/whisper-options.md` for full model comparison and all available options.

**Quick reference:** `tiny` (fastest) < `base` (balanced) < `small` (better) < `medium` (high) < `large` (best accuracy)

For MacBook Pro with Apple Silicon: `small` or `medium` models recommended for best speed/accuracy balance.

## Troubleshooting

### "whisper: command not found"
```bash
# macOS
brew install openai-whisper

# Linux
pip install openai-whisper
export PATH="$HOME/.local/bin:$PATH"
```

### "ffmpeg not found"
```bash
# macOS
brew install ffmpeg

# Linux
sudo apt install ffmpeg
```

### Out of memory errors
Use a smaller model:
```bash
whisper "audio.mp3" --model tiny
```

### Slow transcription
- Use `tiny` or `base` model for faster results
- Ensure correct architecture is being used (Apple Silicon vs Intel)

## Resources

### scripts/

The `scripts/transcribe_with_context.py` script automates the full workflow:
- Finds context files automatically
- Runs Whisper transcription
- Applies context-based corrections
- Saves the final transcript

Usage:
```bash
python scripts/transcribe_with_context.py /path/to/audio.mp3
```

### references/

See `references/whisper-options.md` for complete CLI reference and advanced options.

### assets/

The `assets/context-template.md` provides a template for creating context files to improve transcription accuracy.

Related Skills

videodb

144923
from affaan-m/everything-claude-code

视频与音频的查看、理解与行动。查看:从本地文件、URL、RTSP/直播源或实时录制桌面获取内容;返回实时上下文和可播放流链接。理解:提取帧,构建视觉/语义/时间索引,并通过时间戳和自动剪辑搜索片段。行动:转码和标准化(编解码器、帧率、分辨率、宽高比),执行时间线编辑(字幕、文本/图像叠加、品牌化、音频叠加、配音、翻译),生成媒体资源(图像、音频、视频),并为直播流或桌面捕获的事件创建实时警报。

Media ProcessingClaude

grail-miner

159
from majiayu000/claude-skill-registry

This skill assists in setting up, managing, and optimizing Grail miners on Bittensor Subnet 81, handling tasks like environment configuration, R2 storage, model checkpoint management, and performance tuning.

DevOps & Infrastructure

ux

159
from majiayu000/claude-skill-registry

This AI agent skill provides comprehensive guidance for creating professional and insightful User Experience (UX) designs, covering user research, information architecture, interaction design, visual guidance, and usability evaluation. It aims to produce actionable, user-centered solutions that avoid generic AI aesthetics.

UX Design & StrategyClaude

astro

159
from majiayu000/claude-skill-registry

This skill provides essential Astro framework patterns, focusing on server-side rendering (SSR), static site generation (SSG), middleware, and TypeScript best practices. It helps AI agents implement secure authentication, manage API routes, and debug rendering behaviors within Astro projects.

Coding & Development

modal-deployment

159
from majiayu000/claude-skill-registry

Run Python code in the cloud with serverless containers, GPUs, and autoscaling using Modal. This skill enables agents to generate code for deploying ML models, running batch jobs, serving APIs, and scaling compute-intensive workloads.

DevOps & Infrastructure

thor-skills

159
from majiayu000/claude-skill-registry

An entry point and router for AI agents to manage various THOR-related cybersecurity tasks, including running scans, analyzing logs, troubleshooting, and maintenance.

SecurityClaude

tech-blog

159
from majiayu000/claude-skill-registry

Generates comprehensive technical blog posts, offering detailed explanations of system internals, architecture, and implementation, either through source code analysis or document-driven research.

Content & DocumentationClaude

chrome-debug

159
from majiayu000/claude-skill-registry

This skill empowers AI agents to debug web applications and inspect browser behavior using the Chrome DevTools Protocol (CDP), offering both collaborative (headful) and automated (headless) modes.

Coding & DevelopmentClaude

ontopo

159
from majiayu000/claude-skill-registry

An AI agent skill to search for Israeli restaurants, check table availability, view menus, and retrieve booking links via the Ontopo platform, acting as an unofficial interface to its data.

General Utilities

vly-money

159
from majiayu000/claude-skill-registry

Generate crypto payment links for supported tokens and networks, manage access to X402 payment-protected content, and provide direct access to the vly.money wallet interface.

Fintech & CryptoClaude

lets-go-rss

159
from majiayu000/claude-skill-registry

A lightweight, full-platform RSS subscription manager that aggregates content from YouTube, Vimeo, Behance, Twitter/X, and Chinese platforms like Bilibili, Weibo, and Douyin, featuring deduplication and AI smart classification.

Content & Documentation

advanced-skill-creator

181
from majiayu000/claude-skill-registry

Meta-skill that generates domain-specific skills using advanced reasoning techniques. PROACTIVELY activate for: (1) Create/build/make skills, (2) Generate expert panels for any domain, (3) Design evaluation frameworks, (4) Create research workflows, (5) Structure complex multi-step processes, (6) Instantiate templates with parameters. Triggers: "create a skill for", "build evaluation for", "design workflow for", "generate expert panel for", "how should I approach [complex task]", "create skill", "new skill for", "skill template", "generate skill"