transcription

Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.

248 stars

Best use case

transcription is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.

Teams using transcription should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/transcription/SKILL.md --create-dirs "https://raw.githubusercontent.com/MadAppGang/claude-code/main/plugins/video-editing/skills/transcription/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/transcription/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How transcription Compares

Feature / AgenttranscriptionStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

plugin: video-editing
updated: 2026-01-20

# Transcription with Whisper

Production-ready patterns for audio/video transcription using OpenAI Whisper.

## System Requirements

### Installation Options

**Option 1: OpenAI Whisper (Python)**
```bash
# macOS/Linux/Windows
pip install openai-whisper

# Verify
whisper --help
```

**Option 2: whisper.cpp (C++ - faster)**
```bash
# macOS
brew install whisper-cpp

# Linux - build from source
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make

# Windows - use pre-built binaries or build with cmake
```

**Option 3: Insanely Fast Whisper (GPU accelerated)**
```bash
pip install insanely-fast-whisper
```

### Model Selection

| Model | Size | VRAM | Accuracy | Speed | Use Case |
|-------|------|------|----------|-------|----------|
| tiny | 39M | ~1GB | Low | Fastest | Quick previews |
| base | 74M | ~1GB | Medium | Fast | Draft transcripts |
| small | 244M | ~2GB | Good | Medium | General use |
| medium | 769M | ~5GB | Better | Slow | Quality transcripts |
| large-v3 | 1550M | ~10GB | Best | Slowest | Final production |

**Recommendation:** Start with `small` for speed/quality balance. Use `large-v3` for final delivery.

## Basic Transcription

### Using OpenAI Whisper

```bash
# Basic transcription (auto-detect language)
whisper audio.mp3 --model small

# Specify language and output format
whisper audio.mp3 --model medium --language en --output_format srt

# Multiple output formats
whisper audio.mp3 --model small --output_format all

# With timestamps and word-level timing
whisper audio.mp3 --model small --word_timestamps True
```

### Using whisper.cpp

```bash
# Download model first
./models/download-ggml-model.sh base.en

# Transcribe
./main -m models/ggml-base.en.bin -f audio.wav -osrt

# With timestamps
./main -m models/ggml-base.en.bin -f audio.wav -ocsv
```

## Output Formats

### SRT (SubRip Subtitle)
```
1
00:00:01,000 --> 00:00:04,500
Hello and welcome to this video.

2
00:00:05,000 --> 00:00:08,200
Today we'll discuss video editing.
```

### VTT (WebVTT)
```
WEBVTT

00:00:01.000 --> 00:00:04.500
Hello and welcome to this video.

00:00:05.000 --> 00:00:08.200
Today we'll discuss video editing.
```

### JSON (with word-level timing)
```json
{
  "text": "Hello and welcome to this video.",
  "segments": [
    {
      "id": 0,
      "start": 1.0,
      "end": 4.5,
      "text": " Hello and welcome to this video.",
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3},
        {"word": "and", "start": 1.4, "end": 1.5},
        {"word": "welcome", "start": 1.6, "end": 2.0},
        {"word": "to", "start": 2.1, "end": 2.2},
        {"word": "this", "start": 2.3, "end": 2.5},
        {"word": "video", "start": 2.6, "end": 3.0}
      ]
    }
  ]
}
```

## Audio Extraction for Transcription

Before transcribing video, extract audio in optimal format:

```bash
# Extract audio as WAV (16kHz, mono - optimal for Whisper)
ffmpeg -i video.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

# Extract as high-quality WAV for archival
ffmpeg -i video.mp4 -vn -c:a pcm_s16le audio.wav

# Extract as compressed MP3 (smaller, still works)
ffmpeg -i video.mp4 -vn -c:a libmp3lame -q:a 2 audio.mp3
```

## Timing Synchronization

### Convert Whisper JSON to FCP Timing

```python
import json

def whisper_to_fcp_timing(whisper_json_path, fps=24):
    """Convert Whisper JSON output to FCP-compatible timing."""
    with open(whisper_json_path) as f:
        data = json.load(f)

    segments = []
    for seg in data.get("segments", []):
        segments.append({
            "start_time": seg["start"],
            "end_time": seg["end"],
            "start_frame": int(seg["start"] * fps),
            "end_frame": int(seg["end"] * fps),
            "text": seg["text"].strip(),
            "words": seg.get("words", [])
        })

    return segments
```

### Frame-Accurate Timing

```bash
# Get exact frame count and duration
ffprobe -v error -count_frames -select_streams v:0 \
  -show_entries stream=nb_read_frames,duration,r_frame_rate \
  -of json video.mp4
```

## Speaker Diarization

For multi-speaker content, use pyannote.audio:

```bash
pip install pyannote.audio
```

```python
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1")
diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")
```

## Batch Processing

```bash
#!/bin/bash
# Transcribe all videos in directory

MODEL="small"
OUTPUT_DIR="transcripts"
mkdir -p "$OUTPUT_DIR"

for video in *.mp4 *.mov *.avi; do
  [[ -f "$video" ]] || continue

  base="${video%.*}"

  # Extract audio
  ffmpeg -i "$video" -ar 16000 -ac 1 -c:a pcm_s16le "/tmp/${base}.wav" -y

  # Transcribe
  whisper "/tmp/${base}.wav" --model "$MODEL" \
    --output_format all \
    --output_dir "$OUTPUT_DIR"

  # Cleanup temp audio
  rm "/tmp/${base}.wav"

  echo "Transcribed: $video"
done
```

## Quality Optimization

### Improve Accuracy

1. **Noise reduction before transcription:**
```bash
ffmpeg -i noisy_audio.wav -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean_audio.wav
```

2. **Use language hint:**
```bash
whisper audio.mp3 --language en --model medium
```

3. **Provide initial prompt for context:**
```bash
whisper audio.mp3 --initial_prompt "Technical discussion about video editing software."
```

### Performance Tips

1. **GPU acceleration (if available):**
```bash
whisper audio.mp3 --model large-v3 --device cuda
```

2. **Process in chunks for long videos:**
```python
# Split audio into 10-minute chunks
# Transcribe each chunk
# Merge results with time offset adjustment
```

## Error Handling

```bash
# Validate audio file before transcription
validate_audio() {
  local file="$1"
  if ffprobe -v error -select_streams a:0 -show_entries stream=codec_type -of csv=p=0 "$file" 2>/dev/null | grep -q "audio"; then
    return 0
  else
    echo "Error: No audio stream found in $file"
    return 1
  fi
}

# Check Whisper installation
check_whisper() {
  if command -v whisper &> /dev/null; then
    echo "Whisper available"
    return 0
  else
    echo "Error: Whisper not installed. Run: pip install openai-whisper"
    return 1
  fi
}
```

## Related Skills

- **ffmpeg-core** - Audio extraction and preprocessing
- **final-cut-pro** - Import transcripts as titles/markers

Related Skills

test-skill

248
from MadAppGang/claude-code

A test skill for validation testing. Use when testing skill parsing and validation logic.

bad-skill

248
from MadAppGang/claude-code

This skill has invalid YAML in frontmatter

release

248
from MadAppGang/claude-code

Plugin release process for MAG Claude Plugins marketplace. Covers version bumping, marketplace.json updates, git tagging, and common mistakes. Use when releasing new plugin versions or troubleshooting update issues.

openrouter-trending-models

248
from MadAppGang/claude-code

Fetch trending programming models from OpenRouter rankings. Use when selecting models for multi-model review, updating model recommendations, or researching current AI coding trends. Provides model IDs, context windows, pricing, and usage statistics from the most recent week.

Claudish Integration Skill

248
from MadAppGang/claude-code

**Version:** 1.0.0

final-cut-pro

248
from MadAppGang/claude-code

Apple Final Cut Pro FCPXML format reference. Covers project structure, timeline creation, clip references, effects, and transitions. Use when generating FCP projects or understanding FCPXML structure.

ffmpeg-core

248
from MadAppGang/claude-code

FFmpeg fundamentals for video/audio manipulation. Covers common operations (trim, concat, convert, extract), codec selection, filter chains, and performance optimization. Use when planning or executing video processing tasks.

statusline-customization

248
from MadAppGang/claude-code

Configuration reference and troubleshooting for the statusline plugin — sections, themes, bar widths, and script architecture

technical-audit

248
from MadAppGang/claude-code

Technical SEO audit methodology including crawlability, indexability, and Core Web Vitals analysis. Use when auditing pages or sites for technical SEO issues.

serp-analysis

248
from MadAppGang/claude-code

SERP analysis techniques for intent classification, feature identification, and competitive intelligence. Use when analyzing search results for content strategy.

schema-markup

248
from MadAppGang/claude-code

Schema.org markup implementation patterns for rich results. Use when adding structured data to content for enhanced SERP appearances.

performance-correlation

248
from MadAppGang/claude-code

Correlate content attributes with GA4 and GSC metrics to identify performance drivers