extraction-pipeline-patterns
extraction pipeline patterns
Best use case
extraction-pipeline-patterns is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
extraction pipeline patterns
Teams using extraction-pipeline-patterns should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/extraction-pipeline-patterns/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How extraction-pipeline-patterns Compares
| Feature / Agent | extraction-pipeline-patterns | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
extraction pipeline patterns
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
SKILL.md Source
# Extraction Pipeline Patterns
**Kreuzberg's format detection -> extraction -> fallback orchestration for 75+ file formats**
## Core Pipeline Architecture
The extraction pipeline (`crates/kreuzberg/src/core/pipeline.rs`, `crates/kreuzberg/src/extraction/`) orchestrates:
1. **Format Detection** - MIME type inference + extension validation -> select appropriate extractor
2. **Intelligent Extraction** - Route to format-specific extractors (PDF, DOCX, Excel, HTML, images, archives, etc.)
3. **Fallback Strategies** - Password-protected PDFs, OCR for images, nested archive handling, corrupted file recovery
4. **Post-Processing Pipeline** - Validators, quality processing, chunking, custom hooks (see `core/pipeline.rs`)
## Format Detection Strategy
**Location**: `crates/kreuzberg/src/core/mime.rs`, `crates/kreuzberg/src/core/formats.rs`
Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity.
```rust
// Pseudocode: core/mime.rs
match (magic_bytes(content), extension) {
(Some(fmt), Some(ext)) if aligned -> Ok(fmt),
(Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch),
(Some(fmt), None) -> Ok(fmt), // magic bytes only
(None, Some(ext)) -> Ok(from_extension(ext)),
_ -> Err(UnknownFormat),
}
```
## Extraction Modules (75 Formats)
| Category | Extractors | Key Modules |
|----------|-----------|------------|
| **Office** | DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS | `extraction/{docx,excel,pptx}.rs` |
| **PDF** | Standard + encrypted, password attempts | `pdf/` subdirectory (13 files) |
| **Images** | PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled) | `extraction/image.rs` + `ocr/` |
| **Web** | HTML, XHTML, XML, SVG (DOM parsing) | `extraction/html.rs` (67KB - complex table handling) |
| **Email** | EML, MSG (headers, body, attachments, threading) | `extraction/email.rs` |
| **Archives** | ZIP, TAR, GZ, 7Z (recursive extraction) | `extraction/archive.rs` (31KB) |
| **Markdown** | MD, TXT, RST, Org Mode, RTF | `extraction/markdown.rs` |
| **Academic** | LaTeX, BibTeX, JATS, Jupyter, DocBook | `extraction/{structured,xml}.rs` |
## Extraction Dispatcher
```rust
// Pseudocode: extraction/mod.rs
let format = detect_format(source.bytes, source.extension);
let result = match format {
Pdf -> extract_pdf(source, config),
Docx -> extract_docx(source, config),
Image -> extract_image_with_ocr_fallback(source, config),
Archive -> extract_archive_recursive(source, config),
_ -> extract_with_plugin(format, source, config),
};
run_pipeline(result, config) // post-processing always runs
```
## Fallback Strategies
- **Password-Protected PDFs**: Try primary password -> secondary password list -> return `is_encrypted=true` in metadata on failure
- **OCR Fallback**: If image text extraction confidence < threshold, trigger OCR backend; return both results with scores
- **Nested Archives**: Recursive extraction with configurable depth limit; flatten or preserve hierarchy
- **Corrupted File Recovery**: Stream-based parsing, emit content up to error point, include error location in metadata
## Configuration Integration
**Location**: `crates/kreuzberg/src/core/config.rs`, `crates/kreuzberg/src/core/config_validation.rs`
`ExtractionConfig` holds format-specific configs (`pdf`, `image`, `html`, `office`), fallback orchestration (`fallback`), and post-processing (`postprocessor`, `chunking`, `keywords`). See struct definition in `config.rs`.
## Plugin System Integration
**Location**: `crates/kreuzberg/src/plugins/`
- **CustomExtractor**: Override built-in format extractors
- **PostProcessor**: Modify results after extraction (Early/Middle/Late stages)
- **Validator**: Fail-fast validation (e.g., minimum text length)
- **OCRBackend**: Swap OCR engine
Plugin registry loaded at startup, cached for zero-cost lookup.
## Feature Flag Strategy
**Location**: `Cargo.toml` (workspace), `crates/kreuzberg/Cargo.toml`, `FEATURE_MATRIX.md`
20+ features across 9 language bindings. Key feature groups:
| Group | Features | Notes |
|-------|----------|-------|
| OCR | `tesseract` (default), `tesseract-static`, `ocr-minimal` | Mutually exclusive recommendation |
| Formats | `pdf`, `pdf-minimal`, `office`, `office-minimal` | |
| AI/ML | `embeddings` (requires ONNX), `keywords-yake`, `keywords-rake`, `language-detection` | |
| Server | `api` (Axum), `mcp`, `tokio-runtime`, `lite-runtime` | |
| Bindings | `python-bindings`, `ruby-bindings`, `php-bindings`, `node-bindings`, `wasm` | |
Conditional compilation: modules gated with `#[cfg(feature = "...")]`. Runtime `validate_config()` warns if requested feature not compiled in.
### Feature Flag Critical Rules
1. **Never mix conflicting features** - e.g., `ocr-minimal` + `tesseract` should error at compile time
2. **Always provide feature diagnostics** - Config validation must warn if feature unavailable
3. **Default to maximum feature set** - Unless embedded/minimal specifically requested
4. **Test all feature combinations** - Matrix testing in CI catches regressions
5. **WASM incompatible** with embeddings, keywords, OCR
## Critical Rules
1. **Always use format detection** before routing to extractors (prevent confusion attacks)
2. **Stream-based parsing** for PDFs/archives to handle multi-GB files
3. **Post-pipeline is mandatory**: All extraction results flow through `run_pipeline()` for validators/hooks
4. **Plugin overrides are order-dependent**: Plugins registered first take priority
5. **Fallback timeouts**: Set reasonable OCR/archive extraction timeouts (config-driven)
6. **Metadata preservation**: Include format detection confidence, extraction method used, any fallbacks applied
## Related Skills
- **ocr-backend-management** - OCR engine selection and image preprocessing
- **chunking-embeddings** - Post-extraction text splitting with FastEmbed
- **api-server-mcp** - Axum endpoint for extraction pipeline exposure and MCP serverRelated Skills
test-execution-patterns
test execution patterns
plugin-architecture-patterns
plugin architecture patterns
format-specific-extraction
format specific extraction
gil-management-patterns
gil management patterns
table-extraction-and-reconstruction
taule extraction and reconstruction
image-preprocessing-pipeline
image preprocessing pipeline
extraction-quality-testing
extraction quality testing
kreuzberg
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
wasm-constraints
wasm constraints
security-limits-dos-protection
security limits dos protection
ocr-backend-management
ocr uackend management
mime-detection-routing
mime detection routing