YouTube Video Transcription
Transcribe YouTube videos to text using OpenAI Whisper and yt-dlp.
Best use case
YouTube Video Transcription is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Transcribe YouTube videos to text using OpenAI Whisper and yt-dlp.
Teams using YouTube Video Transcription should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/youtube-transcription/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How YouTube Video Transcription Compares
| Feature / Agent | YouTube Video Transcription | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Transcribe YouTube videos to text using OpenAI Whisper and yt-dlp.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
# YouTube Video Transcription
Transcribe YouTube videos to text using OpenAI Whisper and yt-dlp.
## Overview
This skill downloads audio from YouTube videos using yt-dlp and transcribes it using OpenAI's Whisper model. Supports multiple output formats (txt, srt, vtt, json) and various model sizes for different accuracy/speed tradeoffs.
## Instructions
### 1. Install dependencies
```bash
# Install whisper and yt-dlp
pip install openai-whisper yt-dlp
# Verify ffmpeg is installed (required for audio processing)
ffmpeg -version
```
If ffmpeg is missing:
- macOS: `brew install ffmpeg`
- Ubuntu/Debian: `sudo apt install ffmpeg`
- Windows: Download from https://ffmpeg.org/download.html
### 2. Download audio from YouTube
```bash
# Download best audio quality as WAV
yt-dlp -x --audio-format wav -o "%(title)s.%(ext)s" "YOUTUBE_URL"
# Download as MP3 (smaller file)
yt-dlp -x --audio-format mp3 -o "%(title)s.%(ext)s" "YOUTUBE_URL"
# Download with video ID as filename (safer for special characters)
yt-dlp -x --audio-format wav -o "%(id)s.%(ext)s" "YOUTUBE_URL"
```
### 3. Choose Whisper model
| Model | Parameters | VRAM | Relative Speed | Use Case |
|-------|------------|------|----------------|----------|
| tiny | 39M | ~1 GB | ~32x | Quick drafts, testing |
| base | 74M | ~1 GB | ~16x | Fast transcription |
| small | 244M | ~2 GB | ~6x | Good balance |
| medium | 769M | ~5 GB | ~2x | High accuracy |
| large | 1550M | ~10 GB | 1x | Best accuracy |
English-only models (`tiny.en`, `base.en`, `small.en`, `medium.en`) are faster for English content.
### 4. Run transcription
**CLI approach:**
```bash
# Basic transcription (auto-detect language)
whisper audio.wav --model medium
# Specify language for better accuracy
whisper audio.wav --model medium --language en
# Output specific format
whisper audio.wav --model medium --output_format srt
# All formats at once
whisper audio.wav --model medium --output_format all
# Specify output directory
whisper audio.wav --model medium --output_dir ./transcripts
```
**Python approach:**
```python
import whisper
# Load model (downloads on first run)
model = whisper.load_model("medium")
# Transcribe
result = model.transcribe("audio.wav", language="en")
# Get plain text
print(result["text"])
# Get segments with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
```
### 5. One-liner pipeline
Combine download and transcription:
```bash
# Download and transcribe in one command
yt-dlp -x --audio-format wav -o "audio.wav" "YOUTUBE_URL" && whisper audio.wav --model medium --output_format all
```
### 6. Alternative: yt-whisper tool
For simpler workflow, use the dedicated yt-whisper package:
```bash
# Install
pip install git+https://github.com/m1guelpf/yt-whisper.git
# Transcribe directly from URL
yt_whisper "https://www.youtube.com/watch?v=VIDEO_ID"
# With options
yt_whisper "YOUTUBE_URL" --model medium --language en --output_format srt
```
## Output Formats
| Format | Extension | Description |
|--------|-----------|-------------|
| txt | .txt | Plain text transcript |
| srt | .srt | SubRip subtitle format (with timestamps) |
| vtt | .vtt | WebVTT subtitle format |
| tsv | .tsv | Tab-separated values |
| json | .json | Full data with word-level timestamps |
## Examples
<example>
User: Transcribe this YouTube video to text
Steps:
1. yt-dlp -x --audio-format wav -o "video.wav" "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
2. whisper video.wav --model medium --language en --output_format txt
Output: video.txt with full transcript
</example>
<example>
User: Generate SRT subtitles for a YouTube lecture
Steps:
1. yt-dlp -x --audio-format wav -o "lecture.wav" "https://www.youtube.com/watch?v=LECTURE_ID"
2. whisper lecture.wav --model medium --output_format srt
Output: lecture.srt with timestamped subtitles
</example>
<example>
User: Transcribe a Spanish YouTube video
Steps:
1. yt-dlp -x --audio-format wav -o "spanish.wav" "https://www.youtube.com/watch?v=VIDEO_ID"
2. whisper spanish.wav --model medium --language es --output_format all
Output: spanish.txt, spanish.srt, spanish.vtt, spanish.json
</example>
<example>
User: Quick transcription of a short video (speed over accuracy)
Command: yt-dlp -x --audio-format mp3 -o "quick.mp3" "URL" && whisper quick.mp3 --model tiny.en
</example>
<example>
User: Get transcript with timestamps in Python
```python
import whisper
model = whisper.load_model("medium")
result = model.transcribe("audio.wav")
for seg in result["segments"]:
print(f"[{seg['start']:.1f}s] {seg['text']}")
```
</example>
## Guidelines
- Use `--language` flag when you know the spoken language for significantly better accuracy
- For long videos (>1 hour), use `small` or `medium` model to balance speed and accuracy
- English-only models (`.en` suffix) are faster and more accurate for English content
- GPU with CUDA dramatically speeds up transcription; CPU works but is 5-10x slower
- If transcription fails, ensure ffmpeg is properly installed and in PATH
- For videos with background music, larger models (medium/large) handle it better
- Clean up audio files after transcription to save disk space
- Use `--output_format all` to get every format at once, then choose what you needRelated Skills
youtube-music
Search and play music tracks on YouTube Music through MCP integration. Use when user wants to search for songs, play music, or discover tracks on YouTube Music platform.
capy-video-gen-skill
Multi-shot AI video generation pipeline with face identity consistency. Converts scripts or ideas into complete videos using character extraction, storyboarding, frame generation, and video assembly. 300 experiments validated, 70% face distance improvement. Use when the user asks to create a video from a script, story, idea, or wants multi-shot video with consistent characters.
video-comparer
This skill should be used when comparing two videos to analyze compression results or quality differences. Generates interactive HTML reports with quality metrics (PSNR, SSIM) and frame-by-frame visual comparisons. Triggers when users mention "compare videos", "video quality", "compression analysis", "before/after compression", or request quality assessment of compressed videos.
video-enhancement
AI Video Enhancement - Upscale video resolution, improve quality, denoise, sharpen, enhance low-quality videos to HD/4K. Supports local video files, remote URLs (YouTube, Bilibili), auto-download, real-time progress tracking.
youtube-summarizer
Extract transcripts from YouTube videos and generate comprehensive, detailed summaries using intelligent analysis frameworks
youtube-automation
Automate YouTube tasks via Rube MCP (Composio): upload videos, manage playlists, search content, get analytics, and handle comments. Always search tools first for current schemas.
azure-ai-transcription-py
Azure AI Transcription SDK for Python. Use for real-time and batch speech-to-text transcription with timestamps and diarization. Triggers: "transcription", "speech to text", "Azure AI Transcription", "TranscriptionClient".
ai-avatar-video
Create AI avatar and talking head videos with OmniHuman, Fabric, PixVerse via inference.sh CLI. Models: OmniHuman 1.5, OmniHuman 1.0, Fabric 1.0, PixVerse Lipsync. Capabilities: audio-driven avatars, lipsync videos, talking head generation, virtual presenters. Use for: AI presenters, explainer videos, virtual influencers, dubbing, marketing videos. Triggers: ai avatar, talking head, lipsync, avatar video, virtual presenter, ai spokesperson, audio driven video, heygen alternative, synthesia alternative, talking avatar, lip sync, video avatar, ai presenter, digital human
video-prompting-guide
Best practices and techniques for writing effective AI video generation prompts. Covers: Veo, Seedance, Wan, Grok, Kling, Runway, Pika, Sora prompting strategies. Learn: shot types, camera movements, lighting, pacing, style keywords, negative prompts. Use for: improving video quality, getting consistent results, professional video prompts. Triggers: video prompt, how to prompt video, veo prompts, video generation tips, better ai video, video prompt engineering, video prompt guide, video prompt template, ai video tips, video prompt best practices, video prompt examples, cinematography prompts
image-to-video
Still-to-video conversion guide: model selection, motion prompting, and camera movement. Covers Wan 2.5 i2v, Seedance, Fabric, Grok Video with when to use each. Use for: animating images, creating video from stills, adding motion, product animations. Triggers: image to video, i2v, animate image, still to video, add motion to image, image animation, photo to video, animate still, wan i2v, image2video, bring image to life, animate photo, motion from image
ai-marketing-videos
Create AI marketing videos for ads, promos, product launches, and brand content. Models: Veo, Seedance, Wan, FLUX for visuals, Kokoro for voiceover. Types: product demos, testimonials, explainers, social ads, brand videos. Use for: Facebook ads, YouTube ads, product launches, brand awareness. Triggers: marketing video, ad video, promo video, commercial, brand video, product video, explainer video, ad creative, video ad, facebook ad video, youtube ad, instagram ad, tiktok ad, promotional video, launch video
p-video
Generate videos with Pruna P-Video and WAN models via inference.sh CLI. Models: P-Video, WAN-T2V, WAN-I2V. Capabilities: text-to-video, image-to-video, audio support, 720p/1080p, fast inference. Pruna optimizes models for speed without quality loss. Triggers: pruna video, p-video, pruna ai video, fast video generation, optimized video, wan t2v, wan i2v, economic video generation, cheap video generation, pruna text to video, pruna image to video