PDF OCR using Gemini LLM

Extract text from PDFs using Google Gemini OCR. Use when extracting text from PDFs, performing OCR on scanned documents, or processing image-based PDFs.

3,891 stars

Best use case

PDF OCR using Gemini LLM is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Extract text from PDFs using Google Gemini OCR. Use when extracting text from PDFs, performing OCR on scanned documents, or processing image-based PDFs.

Teams using PDF OCR using Gemini LLM should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/geminipdfocr/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/ashtonizmev/geminipdfocr/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/geminipdfocr/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How PDF OCR using Gemini LLM Compares

Feature / AgentPDF OCR using Gemini LLMStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Extract text from PDFs using Google Gemini OCR. Use when extracting text from PDFs, performing OCR on scanned documents, or processing image-based PDFs.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

## Purpose

Use geminipdfocr to extract text from PDF documents via OCR (Google Gemini).

## Data and privacy

**Full page images/files are sent to Google's API.** PDFs are split into single-page files and each page is uploaded to Google Gemini for OCR. There are no hidden exfiltration endpoints or other data collection. Do not use with highly sensitive documents unless you accept that content is sent to Google.

## Setup (venv installation)

Before first use, create and activate the virtual environment:

```bash
cd geminipdfocr && python -m venv venv && source venv/bin/activate && pip install -r requirements.txt
```

Set `GOOGLE_API_KEY` in your environment before running (e.g. `export GOOGLE_API_KEY=your-key`).

## How to use

When requested to extract text or perform OCR on a PDF:

1. Run: `cd geminipdfocr && source venv/bin/activate && python -m geminipdfocr <path-to-pdf> [--json] [--output <file>]`
2. Use `--json` for structured data.
3. Use `--max-pages N` for testing or very long documents.
4. Use `--quiet` to suppress progress logs.

## Requirements

- A valid PDF file path.
- `GOOGLE_API_KEY` set in the process environment (e.g. `export GOOGLE_API_KEY=your-key`).

## CLI options

| Option | Description |
|--------|-------------|
| `pdf_path` | One or more PDF file paths (positional) |
| `--max-pages N` | Limit pages per PDF |
| `--json` | Output structured JSON instead of plain text |
| `--output FILE` | Write result to file (default: stdout) |
| `--quiet` | Suppress INFO/DEBUG logs |

Related Skills

enable-chrome-gemini

3891
from openclaw/skills

Set up or repair Gemini in Chrome (Glic) on Windows, macOS, or Linux when enabling it for the first time outside the US or when the sidebar, floating panel, Alt+G shortcut, or top-bar entry disappears. Back up and patch Chrome Local State, restore region/eligibility fields, and check the required Glic flags and Chrome language.

gemini-deep-research

3891
from openclaw/skills

Perform complex, long-running research tasks using Gemini Deep Research Agent. Use when asked to research topics requiring multi-source synthesis, competitive analysis, market research, or comprehensive technical investigations that benefit from systematic web search and analysis.

gemini-stt

3891
from openclaw/skills

Transcribe audio files using Google's Gemini API or Vertex AI

gemini-computer-use

3891
from openclaw/skills

Build and run Gemini 2.5 Computer Use browser-control agents with Playwright. Use when a user wants to automate web browser tasks via the Gemini Computer Use model, needs an agent loop (screenshot → function_call → action → function_response), or asks to integrate safety confirmation for risky UI actions.

gemini-voice-assistant

3891
from openclaw/skills

Voice-to-voice AI assistant using Gemini Live API. Speak to the AI and get spoken responses. Use when you want to have natural voice conversations with an AI assistant powered by Google's Gemini models.

gemini-assistant

3891
from openclaw/skills

General-purpose AI assistant using Gemini API with voice and text support. Use when you need a smart AI assistant that can answer questions, have conversations, or help with general tasks using Google's Gemini models with audio/text capabilities.

gemini-video-analyzer

3891
from openclaw/skills

Native video analysis using Google Gemini API. Upload and analyze video files — describe scenes, extract text/UI, answer questions about content, transcribe speech, identify objects and actions. Use when: (1) User sends a video file and wants it analyzed, (2) Video summarization or description needed, (3) Extracting text, UI elements, or information from screen recordings, (4) Answering questions about video content, (5) Comparing multiple videos, (6) Analyzing tutorials, demos, or walkthroughs.

gemini-nano-banana

3891
from openclaw/skills

Auto-generated skill for gemini tools via OneKey Gateway.

upwork-automation-using-ai

3891
from openclaw/skills

Automate Upwork job search and proposal drafting in one browser session using the Browser Relay/Browser Automation workflow. Use when the user wants to: (1) open/login Upwork, (2) find top matching jobs from explicit criteria, (3) filter out disqualifiers, (4) open the best job, and (5) fill proposal fields without submitting. Also use when the user asks to persist in the same tab/session across steps.

gemini Models for vwu.ai

3891
from openclaw/skills

vwu.ai 平台上的 gemini 模型调用技能。

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891
from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation