historical-document-ocr

Transcribe scanned/photographed historical documents (PDFs or images) to text with Gemini vision, using accuracy-first practices: per-page high-res rendering, faded-scan image enhancement, strict verbatim prompting, and an optional multi-model consensus pass that reconciles disagreements by re-reading the page. Use this whenever the user wants to OCR, transcribe, or extract the text of scanned letters, manuscripts, typescripts, carbon copies, ledgers, archival records, genealogy documents, old correspondence, or any image-only PDF that has no real text layer — especially when the material is handwritten, typewritten, faded, rotated, or hard to read and accuracy matters. Trigger even if the user just says "read this old scan", "what does this letter say", "digitize these archive pages", or "transcribe this PDF" and the PDF turns out to be scanned images. Do NOT use this for audio/podcast transcription (that is gemini-podcast-transcribe) or for born-digital PDFs that already contain selectable text (a plain pdftotext is enough there).

6 stars

Best use case

historical-document-ocr is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Transcribe scanned/photographed historical documents (PDFs or images) to text with Gemini vision, using accuracy-first practices: per-page high-res rendering, faded-scan image enhancement, strict verbatim prompting, and an optional multi-model consensus pass that reconciles disagreements by re-reading the page. Use this whenever the user wants to OCR, transcribe, or extract the text of scanned letters, manuscripts, typescripts, carbon copies, ledgers, archival records, genealogy documents, old correspondence, or any image-only PDF that has no real text layer — especially when the material is handwritten, typewritten, faded, rotated, or hard to read and accuracy matters. Trigger even if the user just says "read this old scan", "what does this letter say", "digitize these archive pages", or "transcribe this PDF" and the PDF turns out to be scanned images. Do NOT use this for audio/podcast transcription (that is gemini-podcast-transcribe) or for born-digital PDFs that already contain selectable text (a plain pdftotext is enough there).

Teams using historical-document-ocr should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/historical-document-ocr/SKILL.md --create-dirs "https://raw.githubusercontent.com/tdhopper/dotfiles2.0/main/.claude/skills/historical-document-ocr/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/historical-document-ocr/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How historical-document-ocr Compares

Feature / Agenthistorical-document-ocrStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Transcribe scanned/photographed historical documents (PDFs or images) to text with Gemini vision, using accuracy-first practices: per-page high-res rendering, faded-scan image enhancement, strict verbatim prompting, and an optional multi-model consensus pass that reconciles disagreements by re-reading the page. Use this whenever the user wants to OCR, transcribe, or extract the text of scanned letters, manuscripts, typescripts, carbon copies, ledgers, archival records, genealogy documents, old correspondence, or any image-only PDF that has no real text layer — especially when the material is handwritten, typewritten, faded, rotated, or hard to read and accuracy matters. Trigger even if the user just says "read this old scan", "what does this letter say", "digitize these archive pages", or "transcribe this PDF" and the PDF turns out to be scanned images. Do NOT use this for audio/podcast transcription (that is gemini-podcast-transcribe) or for born-digital PDFs that already contain selectable text (a plain pdftotext is enough there).

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Historical Document OCR / Transcription

Transcribe scanned historical documents with Gemini, optimized for **accuracy on
degraded material** (faded carbons, handwriting, rotation, low contrast) rather
than speed. The bundled script `scripts/transcribe_documents.py` does the whole
pipeline; this file explains how to drive it well.

## When this is the right tool

This is for **image-only** documents — the pixels *are* the only copy of the
text. If `pdftotext file.pdf -` already returns the real text, the document is
born-digital and you don't need OCR at all. Confirm first:

```bash
pdftotext -f 1 -l 3 INPUT.pdf - | wc -c   # near-zero => scanned images => use this skill
```

## Prerequisites

- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`) in the environment.
- `poppler` for `pdftoppm`/`pdftotext` (`brew install poppler`).
- `imagemagick` for faded-scan enhancement (`brew install imagemagick`) —
  optional; the script degrades gracefully without it.
- Python deps (`google-genai`) are installed automatically by `uv run`.

## The core workflow

**1. Look before you transcribe.** Render a few pages and actually read them, so
you know what you're dealing with and can write good context. This one step
drives every later decision (model, enhancement, context).

```bash
pdftoppm -png -r 150 -f 1 -l 4 INPUT.pdf /tmp/peek   # then Read the PNGs
```

Note: handwritten vs typed vs printed? faded? rotated? what's the subject, and
what proper nouns / names / places / dates / era-specific spellings appear? Those
last details become your `--context`.

**2. Write a context block.** Gemini uses it ONLY to disambiguate genuinely
ambiguous characters — it is told never to invent text from it — but it
meaningfully improves accuracy on names, places, and period spellings. Keep it
factual and specific. Example:

```
- Typewritten carbon copies, September 1956.
- Author: Joe B. Hopper, a missionary writing from Chunju, Korea.
- Recipient: Dr. L. Nelson Bell, Montreat, North Carolina.
- The Korean currency "hwan" is spelled "whan" here; keep it as written.
```

**3. Pick the model (accuracy vs cost).** See `references/practices.md` for the
full rationale and current model notes. Short version:
- Default `gemini-3.1-pro-preview` — most faithful on faded/rotated pages and it
  preserves the original's typos instead of silently "correcting" them.
- `gemini-3.5-flash` — near-equal on clean text, cheaper; good at scale.
- `gemini-2.5-flash` — cheapest, but **drops faint lines and scrambles rotated
  text**; avoid on degraded scans.
- At a few-hundred pages the cost gap is pennies, so **buy accuracy** unless the
  job is genuinely large.

Model names change over time. If a call 404s with "no longer available", list
what's live and pick the current successor:

```bash
uv run scripts/transcribe_documents.py --help    # shows defaults
python3 -c "import os;from google import genai;[print(m.name) for m in genai.Client(api_key=os.environ['GEMINI_API_KEY']).models.list() if 'gemini' in m.name]"
```

**4. Smoke-test on the hardest pages first.** Find the most degraded page from
step 1 and run just that, so you catch problems before paying for all 200 pages.

```bash
uv run scripts/transcribe_documents.py INPUT.pdf --pages 5,7-9 \
  --context-file /tmp/context.txt --keep-images
```

**5. Run the full document.** For accuracy-critical work, use `--consensus`
(below). Otherwise a single-model pass is fine.

```bash
uv run scripts/transcribe_documents.py INPUT.pdf --context-file /tmp/context.txt --consensus
```

**6. Report results and the review list.** Tell the user the unanimous/reconciled
split (consensus mode) and which pages have the most disagreements or lowest
confidence — those are the targeted human-review list. Each page's front-matter
holds its `disagreements:` and `notes:`.

## Consensus mode — the accuracy multiplier

`--consensus` transcribes each page with **two independent models**, then:
- if they **agree**, accepts it (tagged `unanimous`, no extra cost);
- if they **disagree**, sends the page image *plus* both candidate transcriptions
  to a judge model that **re-reads the pixels** to adjudicate, marking truly
  ambiguous spots `[illegible]` instead of guessing.

Independent models make independent errors, so agreement is a strong correctness
signal and disagreements pinpoint exactly where to look. Cost is ~2–3× a single
pass. Tune voters with `--voters "modelA,modelB,modelA:0.4"` (a `:TEMP` suffix
adds a temperature-jittered voter) and the adjudicator with `--judge-model`.

## Output layout

Writes to `transcriptions/<pdf-stem>/` (override with `-o`):
- `pages/page-NNN.md` — one page each, with YAML front-matter:
  `confidence`, `illegible_count`, `rotation_observed`, `notes`, and (consensus
  mode) `consensus:` + a `disagreements:` list of the exact words the models
  split on.
- `_combined.md` — the full document, page-delimited.
- `manifest.json` — machine-readable per-page status (drives review triage).
- `images/` — rendered PNGs, only with `--keep-images` (these get large at
  300 DPI; don't commit them — regenerate when needed).

The run is **resumable and idempotent**: finished pages are skipped unless
`--force`, so a failed/interrupted run just gets re-run. Failed pages are
retried with exponential backoff and listed at the end.

## Key flags

| Flag | Purpose |
|------|---------|
| `--consensus` | multi-model + image-grounded reconciliation (accuracy) |
| `--context` / `--context-file` | disambiguation context (names, dates, spellings) |
| `-m MODEL` | single-pass model (default `gemini-3.1-pro-preview`) |
| `--voters` / `--judge-model` | configure consensus models |
| `--pages 5,7-9` | transcribe a subset (smoke tests, re-runs) |
| `--no-enhance` | skip ImageMagick contrast/deskew (use on clean scans) |
| `--force` | re-transcribe pages that already exist |
| `--keep-images` | keep rendered PNGs for visual review |
| `--dpi` `--max-tokens` | render resolution / output cap (defaults 300 / 32768) |

## Quality expectations & verification

Even at state of the art, expect ~1–4% word error on faded/handwritten material;
the failure modes are **skipped lines** and **word substitutions**, not
fabrication (the prompt forbids inventing text). For archival-grade output,
always do a human pass on the flagged disagreement spans against the page images
— consensus mode is what makes that pass small and targeted. See
`references/practices.md` for the research these defaults are based on, the model
selection evidence, and troubleshooting.

Related Skills

stop-slop

6
from tdhopper/dotfiles2.0

Use this skill when writing or editing prose to eliminate predictable AI writing patterns. Helps make writing more direct, authentic, and human.

sonos-control

6
from tdhopper/dotfiles2.0

Control Sonos speakers on Tim's home network. Use when the user wants to (1) play, pause, or stop music on Sonos speakers, (2) change volume on speakers, (3) skip tracks, (4) check what's playing, (5) see speaker status, (6) group or ungroup speakers, (7) any Sonos or music/audio playback task involving home speakers. Triggers on "sonos", "speakers", "play music", "what's playing", "volume", "turn up", "turn down", "pause music", "stop music".

slack-message

6
from tdhopper/dotfiles2.0

Draft and send Slack messages in Tim's natural voice. Use when the user wants to (1) post an update to a channel, (2) draft a Slack message, (3) share something on Slack, (4) send a DM, (5) reply in a thread. Applies Tim's Slack writing style and prose principles automatically.

skill-creator

6
from tdhopper/dotfiles2.0

Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.

sending-to-codex

6
from tdhopper/dotfiles2.0

Delegate tasks or ask questions to OpenAI's Codex CLI from within Claude Code. Use this skill when the user says "ask codex", "send to codex", "delegate to codex", "have codex do this", "get codex's opinion", "run this in codex", or wants to offload a coding task or question to the Codex agent. Supports both fire-and-forget coding tasks (fix bugs, add features, refactor) and research questions (analyze code, explain behavior, get a second opinion).

reviewing-writing

6
from tdhopper/dotfiles2.0

Review and critique writing using Michael Nielsen's principles on craft. Analyzes text for purpose focus, brevity, danger words, opening strength, originality, reader psychology, truthfulness, and title impact. Use when the user says "review my writing", "nielsen review", "writing review", "review this writing", "critique my writing", or asks for feedback on prose quality.

reviewing-code

6
from tdhopper/dotfiles2.0

Review pull requests, branch changes, or code diffs. Triggers on "review this PR", "review my changes", "code review", "review branch", or GitHub PR URLs. Focuses on bugs, tests, complexity, and performance - not linting.

resend-email

6
from tdhopper/dotfiles2.0

Send emails via Resend.com API. Use when the user wants to (1) send an email, (2) email someone, (3) send a message to an email address, (4) send email with attachments, (5) schedule an email for later. Requires RESEND_API_KEY environment variable.

refresh-dotfiles

6
from tdhopper/dotfiles2.0

Full sync of personal (yadm) and work (yadm-work) dotfiles. Pulls remote changes, commits and pushes local changes, and audits for untracked files that should be tracked. Use when the user says 'refresh yadm', 'sync dotfiles', 'dotfiles sync', or 'update dotfiles'.

omnifocus

6
from tdhopper/dotfiles2.0

Interact with OmniFocus task manager via the command-line interface (@stephendolan/omnifocus-cli). Use when the user wants to: (1) Add tasks or projects to OmniFocus, (2) List, view, or search tasks/projects, (3) Update or complete tasks, (4) Manage inbox items, (5) Work with tags and analyze tag usage, (6) Process or organize their OmniFocus database from the command line.

omnifocus-triage

6
from tdhopper/dotfiles2.0

Interactively process OmniFocus inbox items using AskUserQuestion. Use when the user wants to (1) triage their inbox, (2) process inbox items, (3) organize their OmniFocus inbox, (4) clear out their inbox, (5) do a GTD-style inbox review. Triggers on "triage inbox", "process inbox", "organize inbox", "clear inbox", "inbox zero".

Nightshift

6
from tdhopper/dotfiles2.0

Manage and interact with Nightshift, an AI-powered development automation tool that runs coding tasks during off-hours.