kreuzberg
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
Best use case
kreuzberg is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
Teams using kreuzberg should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/kreuzberg/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How kreuzberg Compares
| Feature / Agent | kreuzberg | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
SKILL.md Source
# Kreuzberg Document Extraction
Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
Use this skill when writing code that:
- Extracts text or metadata from documents
- Performs OCR on scanned documents or images
- Batch-processes multiple files
- Configures extraction options (output format, chunking, OCR, language detection)
- Implements custom plugins (post-processors, validators, OCR backends)
## Installation
### Python
```bash
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr] # EasyOCR
pip install kreuzberg[paddleocr] # PaddleOCR
```
### Node.js
```bash
npm install @kreuzberg/node
```
### Rust
```toml
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
# embeddings, language-detection, keywords-yake, keywords-rake
```
### CLI
```bash
# Download from GitHub releases, or:
cargo install kreuzberg-cli
```
## Quick Start
### Python (Async)
```python
from kreuzberg import extract_file
result = await extract_file("document.pdf")
print(result.content) # extracted text
print(result.metadata) # document metadata
print(result.tables) # extracted tables
```
### Python (Sync)
```python
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
```
### Node.js
```typescript
import { extractFile } from '@kreuzberg/node';
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
```
### Node.js (Sync)
```typescript
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf');
```
### Rust (Async)
```rust
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
```
### Rust (Sync) — requires `tokio-runtime` feature
```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
```
### CLI
```bash
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
```
## Configuration
All languages use the same configuration structure with language-appropriate naming conventions.
### Python (snake_case)
```python
from kreuzberg import (
ExtractionConfig, OcrConfig, TesseractConfig,
PdfConfig, ChunkingConfig,
)
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
),
pdf_options=PdfConfig(passwords=["secret123"]),
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
output_format="markdown",
)
result = await extract_file("document.pdf", config=config)
```
### Node.js (camelCase)
```typescript
import { extractFile, type ExtractionConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: { backend: 'tesseract', language: 'eng' },
pdfOptions: { passwords: ['secret123'] },
chunking: { maxChars: 1000, maxOverlap: 200 },
outputFormat: 'markdown',
};
const result = await extractFile('document.pdf', null, config);
```
### Rust (snake_case)
```rust
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".into(),
language: "eng".into(),
..Default::default()
}),
chunking: Some(ChunkingConfig {
max_characters: 1000,
overlap: 200,
..Default::default()
}),
output_format: OutputFormat::Markdown,
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
```
### Config File (TOML)
```toml
output_format = "markdown"
[ocr]
backend = "tesseract"
language = "eng"
[chunking]
max_chars = 1000
max_overlap = 200
[pdf_options]
passwords = ["secret123"]
```
```bash
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
```
## Batch Processing
### Python
```python
from kreuzberg import batch_extract_files, batch_extract_files_sync
# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
for result in results:
print(f"{len(result.content)} chars extracted")
```
### Node.js
```typescript
import { batchExtractFiles } from '@kreuzberg/node';
const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
```
### Rust — requires `tokio-runtime` feature
```rust
use kreuzberg::{batch_extract_file, ExtractionConfig};
let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
```
### CLI
```bash
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
```
## OCR
OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
### Backends
- **Tesseract** (default): Built-in native binding. All Tesseract languages supported.
- **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.
- **PaddleOCR** (Python only): `pip install kreuzberg[paddleocr]`. Pass `paddleocr_kwargs={"use_angle_cls": True}`.
- **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`.
### Language Codes
```python
config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed
```
### Force OCR
```python
config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable
```
## ExtractionResult Fields
| Field | Python | Node.js | Rust | Description |
|-------|--------|---------|------|-------------|
| Text content | `result.content` | `result.content` | `result.content` | Extracted text (str/String) |
| MIME type | `result.mime_type` | `result.mimeType` | `result.mime_type` | Input document MIME type |
| Metadata | `result.metadata` | `result.metadata` | `result.metadata` | Document metadata (dict/object/HashMap) |
| Tables | `result.tables` | `result.tables` | `result.tables` | Extracted tables with cells + markdown |
| Languages | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled) |
| Chunks | `result.chunks` | `result.chunks` | `result.chunks` | Text chunks (if chunking enabled) |
| Images | `result.images` | `result.images` | `result.images` | Extracted images (if enabled) |
| Elements | `result.elements` | `result.elements` | `result.elements` | Semantic elements (if element_based format) |
| Pages | `result.pages` | `result.pages` | `result.pages` | Per-page content (if page extraction enabled) |
| Keywords | `result.keywords` | `result.keywords` | `result.keywords` | Extracted keywords (if enabled) |
## Error Handling
### Python
```python
from kreuzberg import (
extract_file_sync, KreuzbergError, ParsingError,
OCRError, ValidationError, MissingDependencyError,
)
try:
result = extract_file_sync("file.pdf")
except ParsingError as e:
print(f"Failed to parse: {e}")
except OCRError as e:
print(f"OCR failed: {e}")
except ValidationError as e:
print(f"Invalid input: {e}")
except MissingDependencyError as e:
print(f"Missing dependency: {e}")
except KreuzbergError as e:
print(f"Extraction failed: {e}")
```
### Node.js
```typescript
import {
extractFile, KreuzbergError, ParsingError,
OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';
try {
const result = await extractFile('file.pdf');
} catch (e) {
if (e instanceof ParsingError) { /* ... */ }
else if (e instanceof OcrError) { /* ... */ }
else if (e instanceof ValidationError) { /* ... */ }
else if (e instanceof KreuzbergError) { /* ... */ }
}
```
### Rust
```rust
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
Ok(result) => println!("{}", result.content),
Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
Err(e) => eprintln!("Error: {e}"),
}
```
## Common Pitfalls
1. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.
2. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.
3. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.
4. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context.
5. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).
6. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).
7. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.
8. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).
## Supported Formats (Summary)
| Category | Extensions |
|----------|-----------|
| **PDF** | `.pdf` |
| **Word** | `.docx`, `.odt` |
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` |
| **Presentations** | `.pptx`, `.ppt`, `.ppsx` |
| **eBooks** | `.epub`, `.fb2` |
| **Images** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` |
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml` |
| **Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` |
| **Text** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` |
| **Email** | `.eml`, `.msg` |
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` |
| **Academic** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff` |
See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.
## Additional Resources
Detailed reference files for specific topics:
- **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures
- **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs
- **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples
- **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes
- **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
- **[Supported Formats](references/supported-formats.md)** — All 85+ formats with file extensions and MIME types
- **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits
- **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker
Full documentation: <https://docs.kreuzberg.dev>
GitHub: <https://github.com/kreuzberg-dev/kreuzberg>Related Skills
wasm-constraints
wasm constraints
test-execution-patterns
test execution patterns
security-limits-dos-protection
security limits dos protection
plugin-architecture-patterns
plugin architecture patterns
ocr-backend-management
ocr uackend management
mime-detection-routing
mime detection routing
format-specific-extraction
format specific extraction
extraction-pipeline-patterns
extraction pipeline patterns
config-loading-precedence
config loading precedence
chunking-embeddings
chunking emueddings
api-server-mcp
api server mcp
registry-implementation
registry implementation