pdf-text-extractor

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

1,864 stars

byLeoYeAI

View on GitHub Installation ↓

Best use case

pdf-text-extractor is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

Teams using pdf-text-extractor should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/pdf-text-extractor/SKILL.md --create-dirs "https://raw.githubusercontent.com/LeoYeAI/openclaw-master-skills/main/skills/pdf-text-extractor/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/pdf-text-extractor/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How pdf-text-extractor Compares

Feature / Agent	pdf-text-extractor	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# PDF-Text-Extractor - Extract Text from PDFs

**Vernox Utility Skill - Perfect for document digitization.**

## Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

## Features

### ✅ Text Extraction
- Extract text from PDFs without external tools
- Support for both text-based and scanned PDFs
- Preserve document structure and formatting
- Fast extraction (milliseconds for text-based)

### ✅ OCR Support
- Use Tesseract.js for scanned documents
- Support multiple languages (English, Spanish, French, German)
- Configurable OCR quality/speed
- Fallback to text extraction when possible

### ✅ Batch Processing
- Process multiple PDFs at once
- Batch extraction for document workflows
- Progress tracking for large files
- Error handling and retry logic

### ✅ Output Options
- Plain text output
- JSON output with metadata
- Markdown conversion
- HTML output (preserving links)

### ✅ Utility Features
- Page-by-page extraction
- Character/word counting
- Language detection
- Metadata extraction (author, title, creation date)

## Installation

```bash
clawhub install pdf-text-extractor
```

## Quick Start

### Extract Text from PDF

```javascript
const result = await extractText({
  pdfPath: './document.pdf',
  options: {
    outputFormat: 'text',
    ocr: true,
    language: 'eng'
  }
});

console.log(result.text);
console.log(`Pages: ${result.pages}`);
console.log(`Words: ${result.wordCount}`);
```

### Batch Extract Multiple PDFs

```javascript
const results = await extractBatch({
  pdfFiles: [
    './document1.pdf',
    './document2.pdf',
    './document3.pdf'
  ],
  options: {
    outputFormat: 'json',
    ocr: true
  }
});

console.log(`Extracted ${results.length} PDFs`);
```

### Extract with OCR

```javascript
const result = await extractText({
  pdfPath: './scanned-document.pdf',
  options: {
    ocr: true,
    language: 'eng',
    ocrQuality: 'high'
  }
});

// OCR will be used (scanned document detected)
```

## Tool Functions

### `extractText`
Extract text content from a single PDF file.

**Parameters:**
- `pdfPath` (string, required): Path to PDF file
- `options` (object, optional): Extraction options
  - `outputFormat` (string): 'text' | 'json' | 'markdown' | 'html'
  - `ocr` (boolean): Enable OCR for scanned docs
  - `language` (string): OCR language code ('eng', 'spa', 'fra', 'deu')
  - `preserveFormatting` (boolean): Keep headings/structure
  - `minConfidence` (number): Minimum OCR confidence score (0-100)

**Returns:**
- `text` (string): Extracted text content
- `pages` (number): Number of pages processed
- `wordCount` (number): Total word count
- `charCount` (number): Total character count
- `language` (string): Detected language
- `metadata` (object): PDF metadata (title, author, creation date)
- `method` (string): 'text' or 'ocr' (extraction method)

### `extractBatch`
Extract text from multiple PDF files at once.

**Parameters:**
- `pdfFiles` (array, required): Array of PDF file paths
- `options` (object, optional): Same as extractText

**Returns:**
- `results` (array): Array of extraction results
- `totalPages` (number): Total pages across all PDFs
- `successCount` (number): Successfully extracted
- `failureCount` (number): Failed extractions
- `errors` (array): Error details for failures

### `countWords`
Count words in extracted text.

**Parameters:**
- `text` (string, required): Text to count
- `options` (object, optional):
  - `minWordLength` (number): Minimum characters per word (default: 3)
  - `excludeNumbers` (boolean): Don't count numbers as words
  - `countByPage` (boolean): Return word count per page

**Returns:**
- `wordCount` (number): Total word count
- `charCount` (number): Total character count
- `pageCounts` (array): Word count per page
- `averageWordsPerPage` (number): Average words per page

### `detectLanguage`
Detect the language of extracted text.

**Parameters:**
- `text` (string, required): Text to analyze
- `minConfidence` (number): Minimum confidence for detection

**Returns:**
- `language` (string): Detected language code
- `languageName` (string): Full language name
- `confidence` (number): Confidence score (0-100)

## Use Cases

### Document Digitization
- Convert paper documents to digital text
- Process invoices and receipts
- Digitize contracts and agreements
- Archive physical documents

### Content Analysis
- Extract text for analysis tools
- Prepare content for LLM processing
- Clean up scanned documents
- Parse PDF-based reports

### Data Extraction
- Extract data from PDF reports
- Parse tables from PDFs
- Pull structured data
- Automate document workflows

### Text Processing
- Prepare content for translation
- Clean up OCR output
- Extract specific sections
- Search within PDF content

## Performance

### Text-Based PDFs
- **Speed:** ~100ms for 10-page PDF
- **Accuracy:** 100% (exact text)
- **Memory:** ~10MB for typical document

### OCR Processing
- **Speed:** ~1-3s per page (high quality)
- **Accuracy:** 85-95% (depends on scan quality)
- **Memory:** ~50-100MB peak during OCR

## Technical Details

### PDF Parsing
- Uses native PDF.js library
- Extracts text layer directly (no OCR needed)
- Preserves document structure
- Handles password-protected PDFs

### OCR Engine
- Tesseract.js under the hood
- Supports 100+ languages
- Adjustable quality/speed tradeoff
- Confidence scoring for accuracy

### Dependencies
- **ZERO external dependencies**
- Uses Node.js built-in modules only
- PDF.js included in skill
- Tesseract.js bundled

## Error Handling

### Invalid PDF
- Clear error message
- Suggest fix (check file format)
- Skip to next file in batch

### OCR Failure
- Report confidence score
- Suggest rescan at higher quality
- Fallback to basic extraction

### Memory Issues
- Stream processing for large files
- Progress reporting
- Graceful degradation

## Configuration

### Edit `config.json`:
```json
{
  "ocr": {
    "enabled": true,
    "defaultLanguage": "eng",
    "quality": "medium",
    "languages": ["eng", "spa", "fra", "deu"]
  },
  "output": {
    "defaultFormat": "text",
    "preserveFormatting": true,
    "includeMetadata": true
  },
  "batch": {
    "maxConcurrent": 3,
    "timeoutSeconds": 30
  }
}
```

## Examples

### Extract from Invoice
```javascript
const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."
```

### Extract from Scanned Contract
```javascript
const contract = await extractText('./scanned-contract.pdf', {
  ocr: true,
  language: 'eng',
  ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."
```

### Batch Process Documents
```javascript
const docs = await extractBatch([
  './doc1.pdf',
  './doc2.pdf',
  './doc3.pdf',
  './doc4.pdf'
]);
console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);
```

## Troubleshooting

### OCR Not Working
- Check if PDF is truly scanned (not text-based)
- Try different quality settings (low/medium/high)
- Ensure language matches document
- Check image quality of scan

### Extraction Returns Empty
- PDF may be image-only
- OCR failed with low confidence
- Try different language setting

### Slow Processing
- Large PDF takes longer
- Reduce quality for speed
- Process in smaller batches

## Tips

### Best Results
- Use text-based PDFs when possible (faster, 100% accurate)
- High-quality scans for OCR (300 DPI+)
- Clean background before scanning
- Use correct language setting

### Performance Optimization
- Batch processing for multiple files
- Disable OCR for text-based PDFs
- Lower OCR quality for speed when acceptable

## Roadmap

- [ ] PDF/A support
- [ ] Advanced OCR pre-processing
- [ ] Table extraction from OCR
- [ ] Handwriting OCR
- [ ] PDF form field extraction
- [ ] Batch language detection
- [ ] Confidence scoring visualization

## License

MIT

---

**Extract text from PDFs. Fast, accurate, zero dependencies.** 🔮

Related Skills

Sample Text Processor

1864

from LeoYeAI/openclaw-master-skills

---

product-marketing-context

1864

from LeoYeAI/openclaw-master-skills

When the user wants to create or update their product marketing context document. Also use when the user mentions 'product context,' 'marketing context,' 'set up context,' 'positioning,' or wants to avoid repeating foundational information across marketing tasks. Creates `.agents/product-marketing-context.md` that other marketing skills reference.

marketing-context

1864

from LeoYeAI/openclaw-master-skills

Create and maintain the marketing context document that all marketing skills read before starting. Use when the user mentions 'marketing context,' 'brand voice,' 'set up context,' 'target audience,' 'ICP,' 'style guide,' 'who is my customer,' 'positioning,' or wants to avoid repeating foundational information across marketing tasks. Run this at the start of any new project before using other marketing skills.

humanize-ai-text

1864

from LeoYeAI/openclaw-master-skills

Humanize AI-generated text to bypass detection. This humanizer rewrites ChatGPT, Claude, and GPT content to sound natural and pass AI detectors like GPTZero, Turnitin, and Originality.ai. Based on Wikipedia's comprehensive "Signs of AI Writing" guide. Makes robotic AI writing undetectable and human-like.

context-engine

1864

from LeoYeAI/openclaw-master-skills

Loads and manages company context for all C-suite advisor skills. Reads ~/.claude/company-context.md, detects stale context (>90 days), enriches context during conversations, and enforces privacy/anonymization rules before external API calls.

youtube-watcher

1864

from LeoYeAI/openclaw-master-skills

Fetch and read transcripts from YouTube videos. Use when you need to summarize a video, answer questions about its content, or extract information from it.

youtube-transcript

1864

from LeoYeAI/openclaw-master-skills

Fetch and summarize YouTube video transcripts. Use when asked to summarize, transcribe, or extract content from YouTube videos. Handles transcript fetching via residential IP proxy to bypass YouTube's cloud IP blocks.

youtube-auto-captions - YouTube 自动字幕

1864

from LeoYeAI/openclaw-master-skills

## 描述

youtube

1864

from LeoYeAI/openclaw-master-skills

YouTube Data API integration with managed OAuth. Search videos, manage playlists, access channel data, and interact with comments. Use this skill when users want to interact with YouTube. For other third party apps, use the api-gateway skill (https://clawhub.ai/byungkyu/api-gateway).

yahoo-finance

1864

from LeoYeAI/openclaw-master-skills

Get stock prices, quotes, fundamentals, earnings, options, dividends, and analyst ratings using Yahoo Finance. Uses yfinance library - no API key required.

xurl

1864

from LeoYeAI/openclaw-master-skills

A Twitter research and content intelligence skill focused on attracting WordPress and Shopify clients. Use to analyze Twitter profiles, threads, and conversations for: (1) Identifying what small agency founders and eCommerce brands are discussing; (2) Understanding pain points around WordPress performance, Shopify CRO, and development bottlenecks; (3) Extracting high-performing content angles; (4) Turning insights into authority-building posts; (5) Converting Twitter intelligence into business leverage for clear content angles, strong positioning, and qualified inbound leads.

xlsx

1864

from LeoYeAI/openclaw-master-skills

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.