YouTube Video Transcription

Transcribe YouTube videos to text using OpenAI Whisper and yt-dlp.

25 stars

Best use case

YouTube Video Transcription is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Transcribe YouTube videos to text using OpenAI Whisper and yt-dlp.

Teams using YouTube Video Transcription should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/youtube-transcription/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/youtube-transcription/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/youtube-transcription/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How YouTube Video Transcription Compares

Feature / Agent	YouTube Video Transcription	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Transcribe YouTube videos to text using OpenAI Whisper and yt-dlp.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agent for YouTube Script Writing

Find AI agent skills for YouTube script writing, video research, content outlining, and repeatable channel production workflows.

SKILL.md Source

# YouTube Video Transcription

Transcribe YouTube videos to text using OpenAI Whisper and yt-dlp.

## Overview

This skill downloads audio from YouTube videos using yt-dlp and transcribes it using OpenAI's Whisper model. Supports multiple output formats (txt, srt, vtt, json) and various model sizes for different accuracy/speed tradeoffs.

## Instructions

### 1. Install dependencies

```bash
# Install whisper and yt-dlp
pip install openai-whisper yt-dlp

# Verify ffmpeg is installed (required for audio processing)
ffmpeg -version
```

If ffmpeg is missing:
- macOS: `brew install ffmpeg`
- Ubuntu/Debian: `sudo apt install ffmpeg`
- Windows: Download from https://ffmpeg.org/download.html

### 2. Download audio from YouTube

```bash
# Download best audio quality as WAV
yt-dlp -x --audio-format wav -o "%(title)s.%(ext)s" "YOUTUBE_URL"

# Download as MP3 (smaller file)
yt-dlp -x --audio-format mp3 -o "%(title)s.%(ext)s" "YOUTUBE_URL"

# Download with video ID as filename (safer for special characters)
yt-dlp -x --audio-format wav -o "%(id)s.%(ext)s" "YOUTUBE_URL"
```

### 3. Choose Whisper model

| Model | Parameters | VRAM | Relative Speed | Use Case |
|-------|------------|------|----------------|----------|
| tiny | 39M | ~1 GB | ~32x | Quick drafts, testing |
| base | 74M | ~1 GB | ~16x | Fast transcription |
| small | 244M | ~2 GB | ~6x | Good balance |
| medium | 769M | ~5 GB | ~2x | High accuracy |
| large | 1550M | ~10 GB | 1x | Best accuracy |

English-only models (`tiny.en`, `base.en`, `small.en`, `medium.en`) are faster for English content.

### 4. Run transcription

**CLI approach:**
```bash
# Basic transcription (auto-detect language)
whisper audio.wav --model medium

# Specify language for better accuracy
whisper audio.wav --model medium --language en

# Output specific format
whisper audio.wav --model medium --output_format srt

# All formats at once
whisper audio.wav --model medium --output_format all

# Specify output directory
whisper audio.wav --model medium --output_dir ./transcripts
```

**Python approach:**
```python
import whisper

# Load model (downloads on first run)
model = whisper.load_model("medium")

# Transcribe
result = model.transcribe("audio.wav", language="en")

# Get plain text
print(result["text"])

# Get segments with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
```

### 5. One-liner pipeline

Combine download and transcription:
```bash
# Download and transcribe in one command
yt-dlp -x --audio-format wav -o "audio.wav" "YOUTUBE_URL" && whisper audio.wav --model medium --output_format all
```

### 6. Alternative: yt-whisper tool

For simpler workflow, use the dedicated yt-whisper package:
```bash
# Install
pip install git+https://github.com/m1guelpf/yt-whisper.git

# Transcribe directly from URL
yt_whisper "https://www.youtube.com/watch?v=VIDEO_ID"

# With options
yt_whisper "YOUTUBE_URL" --model medium --language en --output_format srt
```

## Output Formats

| Format | Extension | Description |
|--------|-----------|-------------|
| txt | .txt | Plain text transcript |
| srt | .srt | SubRip subtitle format (with timestamps) |
| vtt | .vtt | WebVTT subtitle format |
| tsv | .tsv | Tab-separated values |
| json | .json | Full data with word-level timestamps |

## Examples

<example>
User: Transcribe this YouTube video to text
Steps:
1. yt-dlp -x --audio-format wav -o "video.wav" "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
2. whisper video.wav --model medium --language en --output_format txt
Output: video.txt with full transcript
</example>

<example>
User: Generate SRT subtitles for a YouTube lecture
Steps:
1. yt-dlp -x --audio-format wav -o "lecture.wav" "https://www.youtube.com/watch?v=LECTURE_ID"
2. whisper lecture.wav --model medium --output_format srt
Output: lecture.srt with timestamped subtitles
</example>

<example>
User: Transcribe a Spanish YouTube video
Steps:
1. yt-dlp -x --audio-format wav -o "spanish.wav" "https://www.youtube.com/watch?v=VIDEO_ID"
2. whisper spanish.wav --model medium --language es --output_format all
Output: spanish.txt, spanish.srt, spanish.vtt, spanish.json
</example>

<example>
User: Quick transcription of a short video (speed over accuracy)
Command: yt-dlp -x --audio-format mp3 -o "quick.mp3" "URL" && whisper quick.mp3 --model tiny.en
</example>

<example>
User: Get transcript with timestamps in Python
```python
import whisper
model = whisper.load_model("medium")
result = model.transcribe("audio.wav")
for seg in result["segments"]:
    print(f"[{seg['start']:.1f}s] {seg['text']}")
```
</example>

## Guidelines

- Use `--language` flag when you know the spoken language for significantly better accuracy
- For long videos (>1 hour), use `small` or `medium` model to balance speed and accuracy
- English-only models (`.en` suffix) are faster and more accurate for English content
- GPU with CUDA dramatically speeds up transcription; CPU works but is 5-10x slower
- If transcription fails, ensure ffmpeg is properly installed and in PATH
- For videos with background music, larger models (medium/large) handle it better
- Clean up audio files after transcription to save disk space
- Use `--output_format all` to get every format at once, then choose what you need

Related Skills

youtube-music

from ComeOnOliver/skillshub

Search and play music tracks on YouTube Music through MCP integration. Use when user wants to search for songs, play music, or discover tracks on YouTube Music platform.

capy-video-gen-skill

from ComeOnOliver/skillshub

Multi-shot AI video generation pipeline with face identity consistency. Converts scripts or ideas into complete videos using character extraction, storyboarding, frame generation, and video assembly. 300 experiments validated, 70% face distance improvement. Use when the user asks to create a video from a script, story, idea, or wants multi-shot video with consistent characters.

video-comparer

from ComeOnOliver/skillshub

This skill should be used when comparing two videos to analyze compression results or quality differences. Generates interactive HTML reports with quality metrics (PSNR, SSIM) and frame-by-frame visual comparisons. Triggers when users mention "compare videos", "video quality", "compression analysis", "before/after compression", or request quality assessment of compressed videos.

video-enhancement

from ComeOnOliver/skillshub

AI Video Enhancement - Upscale video resolution, improve quality, denoise, sharpen, enhance low-quality videos to HD/4K. Supports local video files, remote URLs (YouTube, Bilibili), auto-download, real-time progress tracking.

youtube-summarizer

from ComeOnOliver/skillshub

Extract transcripts from YouTube videos and generate comprehensive, detailed summaries using intelligent analysis frameworks

Codex

youtube-automation

from ComeOnOliver/skillshub

Automate YouTube tasks via Rube MCP (Composio): upload videos, manage playlists, search content, get analytics, and handle comments. Always search tools first for current schemas.

azure-ai-transcription-py

from ComeOnOliver/skillshub

Azure AI Transcription SDK for Python. Use for real-time and batch speech-to-text transcription with timestamps and diarization. Triggers: "transcription", "speech to text", "Azure AI Transcription", "TranscriptionClient".

ai-avatar-video

from ComeOnOliver/skillshub

Create AI avatar and talking head videos with OmniHuman, Fabric, PixVerse via inference.sh CLI. Models: OmniHuman 1.5, OmniHuman 1.0, Fabric 1.0, PixVerse Lipsync. Capabilities: audio-driven avatars, lipsync videos, talking head generation, virtual presenters. Use for: AI presenters, explainer videos, virtual influencers, dubbing, marketing videos. Triggers: ai avatar, talking head, lipsync, avatar video, virtual presenter, ai spokesperson, audio driven video, heygen alternative, synthesia alternative, talking avatar, lip sync, video avatar, ai presenter, digital human

video-prompting-guide

from ComeOnOliver/skillshub

Best practices and techniques for writing effective AI video generation prompts. Covers: Veo, Seedance, Wan, Grok, Kling, Runway, Pika, Sora prompting strategies. Learn: shot types, camera movements, lighting, pacing, style keywords, negative prompts. Use for: improving video quality, getting consistent results, professional video prompts. Triggers: video prompt, how to prompt video, veo prompts, video generation tips, better ai video, video prompt engineering, video prompt guide, video prompt template, ai video tips, video prompt best practices, video prompt examples, cinematography prompts

image-to-video

from ComeOnOliver/skillshub

Still-to-video conversion guide: model selection, motion prompting, and camera movement. Covers Wan 2.5 i2v, Seedance, Fabric, Grok Video with when to use each. Use for: animating images, creating video from stills, adding motion, product animations. Triggers: image to video, i2v, animate image, still to video, add motion to image, image animation, photo to video, animate still, wan i2v, image2video, bring image to life, animate photo, motion from image

ai-marketing-videos

from ComeOnOliver/skillshub

Create AI marketing videos for ads, promos, product launches, and brand content. Models: Veo, Seedance, Wan, FLUX for visuals, Kokoro for voiceover. Types: product demos, testimonials, explainers, social ads, brand videos. Use for: Facebook ads, YouTube ads, product launches, brand awareness. Triggers: marketing video, ad video, promo video, commercial, brand video, product video, explainer video, ad creative, video ad, facebook ad video, youtube ad, instagram ad, tiktok ad, promotional video, launch video

p-video

from ComeOnOliver/skillshub

Generate videos with Pruna P-Video and WAN models via inference.sh CLI. Models: P-Video, WAN-T2V, WAN-I2V. Capabilities: text-to-video, image-to-video, audio support, 720p/1080p, fast inference. Pruna optimizes models for speed without quality loss. Triggers: pruna video, p-video, pruna ai video, fast video generation, optimized video, wan t2v, wan i2v, economic video generation, cheap video generation, pruna text to video, pruna image to video