pdf-miner

Extract text and tables from PDF files with robust support for global market data formats (currencies, percentages, units). Use when: (1) User asks to read/extract content from a PDF file, (2) User needs text or tables from industry reports, research papers, or financial documents, (3) web_fetch or scrapling fail on a PDF. Supports: keyword search, metrics extraction, table of contents detection, PDF diff/comparison, LLM chunk splitting, batch processing, header/footer cleaning. NOT for: OCR on scanned image-based PDFs, editing/merging PDFs, or creating new PDFs.

3,891 stars

Best use case

pdf-miner is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Extract text and tables from PDF files with robust support for global market data formats (currencies, percentages, units). Use when: (1) User asks to read/extract content from a PDF file, (2) User needs text or tables from industry reports, research papers, or financial documents, (3) web_fetch or scrapling fail on a PDF. Supports: keyword search, metrics extraction, table of contents detection, PDF diff/comparison, LLM chunk splitting, batch processing, header/footer cleaning. NOT for: OCR on scanned image-based PDFs, editing/merging PDFs, or creating new PDFs.

Teams using pdf-miner should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/pdf-miner/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/baichenwzj/pdf-miner/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/pdf-miner/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How pdf-miner Compares

Feature / Agentpdf-minerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Extract text and tables from PDF files with robust support for global market data formats (currencies, percentages, units). Use when: (1) User asks to read/extract content from a PDF file, (2) User needs text or tables from industry reports, research papers, or financial documents, (3) web_fetch or scrapling fail on a PDF. Supports: keyword search, metrics extraction, table of contents detection, PDF diff/comparison, LLM chunk splitting, batch processing, header/footer cleaning. NOT for: OCR on scanned image-based PDFs, editing/merging PDFs, or creating new PDFs.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# PDF Miner Skill

Extract text and tables from PDF files using `pdfplumber` (global market formats).

## Prerequisites

```bash
python -m pip install pdfplumber
```

For OCR capabilities (scanned/image PDFs), also install:

```bash
python -m pip install pymupdf openai
```

## Initial Setup for OCR

Before using `--ocr`, you must provide a vision API credential. There are three ways:

1. **Environment variables** (recommended for temporary use):

   ```bash
   export OCR_API_KEY="your-openrouter-api-key"
   export OCR_MODEL="qwen/qwen3.6-plus:free"   # optional
   export OCR_BASE_URL="https://openrouter.ai/api/v1"   # optional
   ```

2. **Config file** (persistent, skill-specific):  
   Create `skills/skills/pdf-miner/config.json` with:

   ```json
   {
     "vision_api_key": "your-openrouter-api-key",
     "vision_model": "qwen/qwen3.6-plus:free",
     "vision_base_url": "https://openrouter.ai/api/v1"
   }
   ```

3. **Command-line arguments** (override per invocation):

   ```bash
   python scripts/extract_pdf.py scanned.pdf --ocr --ocr-api-key "sk-..." --ocr-model "stepfun/step-3.5-flash:free"
   ```

## Usage

Run commands from this skill directory.

### Basic Extraction

```bash
# Full extraction (text + tables)
python scripts/extract_pdf.py input.pdf

# Output to custom path
python scripts/extract_pdf.py input.pdf output.md

# Specific pages
python scripts/extract_pdf.py input.pdf --pages 1-5,10,15-20

# Text or tables only
python scripts/extract_pdf.py input.pdf --text-only
python scripts/extract_pdf.py input.pdf --tables-only
python scripts/extract_pdf.py input.pdf --tables-only --json
```

### Advanced Modes

```bash
# Search: find pages containing keywords with context
python scripts/extract_pdf.py report.pdf --search "Vietnam export penetration"

# Metrics: extract lines with keywords + numeric values
python scripts/extract_pdf.py report.pdf --metrics "market size growth export penetration"

# TOC: extract table of contents / chapter structure (robust, multi-format)
python scripts/extract_pdf.py report.pdf --toc
# Optionally adjust sensitivity (default: 3 entries per page required)
python scripts/extract_pdf.py report.pdf --toc --toc-min-entries 2

# Diff: compare two PDFs, show pages unique to each
python scripts/extract_pdf.py old_version.pdf new_version.pdf --diff

# Chunk: split output into LLM-friendly chunks
python scripts/extract_pdf.py report.pdf --chunk             # single file, 8000 chars each
python scripts/extract_pdf.py report.pdf --chunk --max-chars 4000
python scripts/extract_pdf.py report.pdf --chunk --output-dir ./chunks   # separate files

# Clean headers/footers
python scripts/extract_pdf.py report.pdf --clean-headers

# Batch: process multiple PDFs
python scripts/extract_pdf.py file1.pdf file2.pdf file3.pdf --output-dir ./extracted
```

### OCR for Scanned/Image PDFs (Automatic by Default)

OCR is automatically triggered for pages with very little extractable text (default threshold: 100 characters). This helps handle scanned or image-based PDFs without requiring the `--ocr` flag.

#### Usage Examples

```bash
# Automatic OCR (default behavior)
python scripts/extract_pdf.py scanned.pdf

# Force OCR on all pages (ignore text length)
python scripts/extract_pdf.py scanned.pdf --ocr

# Force OCR only on specific pages
python scripts/extract_pdf.py scanned.pdf --ocr --ocr-pages 1-5,10

# Adjust OCR quality (DPI)
python scripts/extract_pdf.py scanned.pdf --ocr --ocr-dpi 300

# Use a different vision model
python scripts/extract_pdf.py scanned.pdf --ocr --ocr-model "stepfun/step-3.5-flash:free"

# Disable automatic OCR detection (if you want pure extraction only)
python scripts/extract_pdf.py file.pdf --no-auto-ocr

# Change the low-text threshold (default 100 chars)
python scripts/extract_pdf.py file.pdf --ocr-threshold 200
```

#### Configuration

OCR requires a vision API key. See [Initial Setup for OCR](#initial-setup-for-ocr).

| Option | Default | Description |
|--------|---------|-------------|
| `--ocr` | off | Force OCR on pages (with auto-detect or `--ocr-pages`) |
| `--auto-ocr` | on | Automatically OCR low-text pages (hidden; use `--no-auto-ocr` to disable) |
| `--no-auto-ocr` | - | Disable automatic OCR detection |
| `--ocr-pages` | - | Comma-separated pages/ranges to OCR (requires `--ocr`) |
| `--ocr-threshold` | 100 | Minimum text length to consider a page as "sufficient" (characters) |
| `--ocr-dpi` | 200 | Image DPI for OCR rendering |
| `--ocr-api-key` | from env/config | Override API key |
| `--ocr-base-url` | from env/config | Override API base URL |
| `--ocr-model` | from env/config | Override vision model |

#### Troubleshooting

**OCR failed with "No API key"**  
→ Configure your API key in `config.json` or via `OCR_API_KEY` env var.

**OCR model rejects images**  
→ The configured model might not support vision. Choose a vision-capable model (e.g., `qwen/qwen3.6-plus:free`, `stepfun/step-3.5-flash:free`). The script will attempt to auto-fallback to a known good model if the configured one lacks vision support.

**Too many pages being OCR'd**  
→ Increase the threshold: `--ocr-threshold 300` or `--no-auto-ocr` and selectively use `--ocr-pages`.

**Rate limit errors**  
→ Reduce concurrent OCR calls, switch to a paid model tier, or try a different provider.

## Configuration Reference

| Option | Default | Source |
|--------|---------|--------|
| `OCR_API_KEY` | (none) | env `OCR_API_KEY` or `config.json` `vision_api_key` |
| `OCR_MODEL` | `qwen/qwen3.6-plus:free` | env `OCR_MODEL` or `config.json` `vision_model` |
| `OCR_BASE_URL` | `https://openrouter.ai/api/v1` | env `OCR_BASE_URL` or `config.json` `vision_base_url` |

Precedence: CLI argument > environment variable > `config.json` > hardcoded default.

## Tool Comparison

| Tool   | PDF | Global text | Tables | Search | Metrics | Diff | Chunk |
|--------|-----|----------|--------|--------|---------|------|-------|
| web_fetch | ❌ | - | ❌ | ❌ | ❌ | ❌ | ❌ |
| scrapling | ❌ | - | ❌ | ❌ | ❌ | ❌ | ❌ |
| pypdf | ⚠️ garbled | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| **pdfplumber** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |

## Modes Reference

| Mode    | Flag                      | What it does                                                     |
|---------|---------------------------|------------------------------------------------------------------|
| Full    | (default)                 | Extract all text + tables, page by page                          |
| Search  | `--search "kw1 kw2"`      | Find pages with keywords, show ±N lines context (default 5)     |
| Metrics | `--metrics "kw1 kw2"`     | Extract lines with keywords AND numeric data                    |
| TOC     | `--toc`                   | Detect table of contents / chapter structure (robust multi-format) |
|         | `--toc-min-entries N`     | Minimum TOC entries per page to trust detection (default: 3)    |
| Diff    | `--diff`                  | Compare two PDFs, show matched vs unique pages                 |
| Chunk   | `--chunk`                 | Split into LLM-friendly pieces (`--max-chars N`)                |
| Clean   | `--clean-headers`         | Auto-detect and remove repeated header/footer lines             |
| Batch   | `file1 file2 ...`         | Process multiple PDFs, output to `--output-dir`                 |

## Output Options

| Flag                      | Effect                                                           |
|---------------------------|------------------------------------------------------------------|
| `--output-dir ./dir`      | Output to specified directory                                    |
| `--chunk --output-dir`    | Each chunk as separate file                                      |
| `--context N`             | Context lines around search matches (default 5)                  |
| `--max-chars N`           | Chunk size (default 8000)                                        |
| `--header-lines "a" "b"`  | Manually specify header/footer lines to remove                  |

## Workflow

### 1. Download PDF (if URL)

```python
import urllib.request
urllib.request.urlretrieve(url, "report.pdf")
```

### 2. Extract

Run from this skill directory:

```bash
cd <skill-directory>
python scripts/extract_pdf.py /path/to/report.pdf [options]
```

### 3. Read & Answer

Read the output `.md` file and answer based on the extracted content.

### 4. Clean Up

Delete temporary PDF and `.md` files when done.

## Limitations

- **Scanned/image-based PDFs**: Cannot extract text without OCR. Install OCR dependencies and configure an API key.
- **Embedded charts/graphs**: Only text labels extracted, not chart data.
- **Multi-column layouts**: Use `--layout` flag for improved reading order via x_tolerance.
- **TOC detection**: Robust multi-format matching with validation. Very non-standard layouts may still require manual extraction.
- **Diff**: Uses text similarity (Jaccard on normalized lines), not page numbers. Threshold adjustable via `--diff-threshold N` (default 0.8).

## Troubleshooting

**OCR fails with "No API key"**  
→ Set `OCR_API_KEY` environment variable or fill `config.json`.

**OCR model rejects images**  
→ The configured model may not support vision; either choose a vision-capable model (e.g., `qwen/qwen3.6-plus:free`, `stepfun/step-3.5-flash:free`) or let the script auto-fallback by removing the model setting.

**Rate limit errors**  
→ Reduce concurrent calls, switch to a paid tier, or try a different model provider.

Related Skills

pdf-process-mineru

3891
from openclaw/skills

PDF document parsing tool based on local MinerU, supports converting PDF to Markdown, JSON, and other machine-readable formats.

unstructured-medical-text-miner

3891
from openclaw/skills

Mine unstructured clinical text from MIMIC-IV to extract diagnostic logic.

mineru-pdf-extractor

3891
from openclaw/skills

Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.

ts-interface-miner

3891
from openclaw/skills

一个专门用于分析 TypeScript (.ts/.tsx) 文件的智能助手。它能够根据用户提供的关键词(功能描述、函数名、API 路径),精准定位相关接口定义。该技能深度解析代码结构与注释(JSDoc/单行注释),提取请求方法、路径、参数细节、响应结构及状态码,最终生成结构清晰、信息完整的 Markdown 表格文档。

review-miner

3891
from openclaw/skills

从评论、评价和反馈中提炼卖点、痛点、反对意见与应删除的话术。;use for reviews, voice-of-customer, marketing workflows;do not use for 造假好评, 泄露用户身份.

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891
from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation

find-skills

3891
from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

tavily-search

3891
from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research

baidu-search

3891
from openclaw/skills

Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Data & Research

agent-autonomy-kit

3891
from openclaw/skills

Stop waiting for prompts. Keep working.

Workflow & Productivity

Meeting Prep

3891
from openclaw/skills

Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.

Workflow & Productivity