format-specific-extraction

format specific extraction

7,385 stars

Best use case

format-specific-extraction is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

format specific extraction

Teams using format-specific-extraction should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/format-specific-extraction/SKILL.md --create-dirs "https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/.ai-rulez/skills/format-specific-extraction/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/format-specific-extraction/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How format-specific-extraction Compares

Feature / Agent	format-specific-extraction	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

format specific extraction

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

SKILL.md Source

## priority: high

# Format-Specific Extraction Workflows

## Office XML (DOCX/PPTX/ODT)

```text
ZIP archive → Security validation → XML parsing → Text + tables + metadata
```

1. `ZipBombValidator::new(limits).validate(&mut archive)?`
2. Extract XML files from archive (`word/document.xml`, `ppt/slides/*.xml`, `content.xml`)
3. Parse with `quick-xml::Reader` (streaming) + `DepthValidator` + `StringGrowthValidator`
4. Extract metadata via `crate::extraction::office_metadata::extract_metadata()`
5. See: `extractors/docx.rs`, `extractors/pptx.rs`, `extractors/odt.rs`

## PDF

```text
Bytes → pdfium-render → Per-page text + OCR fallback → Tables → Metadata
```

1. `pdfium.create_document_from_bytes(content, None)?`
2. Check if needs OCR: `config.force_ocr || !has_searchable_text()`
3. Extract text per page, tables if `config.pages` enabled
4. Feature-gated: `#[cfg(feature = "pdf")]`
5. See: `extractors/pdf/mod.rs`

## Archives (ZIP/TAR/7z/GZIP)

```text
Validate → Extract metadata → Extract plaintext files only
```

1. `ZipBombValidator` BEFORE any extraction
2. Extract metadata (file list, sizes)
3. Extract text content from plaintext files
4. Use `build_archive_result()` helper
5. See: `extractors/archive.rs`, `extraction/archive/*.rs`

## Structured Text (JSON/YAML/TOML/XML)

```text
Detect format from MIME → Parse → Pretty-print → Metadata
```

Single `StructuredExtractor` handles multiple MIME types. Parse with format-specific library, pretty-print to text.
See: `extractors/structured.rs`

## Email (EML/MSG)

```text
Parse headers → Extract body (text/html) → Process attachments
```

See: `extraction/email.rs`, `extractors/email.rs`

## Common Helpers

| Helper | Location | Purpose |
|--------|----------|---------|
| `office_metadata::extract_metadata()` | `extraction/office.rs` | Office XML metadata |
| `cells_to_markdown()` | `extraction/mod.rs` | Convert cell grid to GFM table |
| `build_archive_result()` | `extraction/archive/mod.rs` | Standard archive result |

## Adding a New Format

1. Add MIME type to `EXT_TO_MIME` in `core/mime.rs`
2. Create extractor implementing `DocumentExtractor` trait
3. Set `supported_mime_types()` and `priority()` (default: 50)
4. Register in `extractors/mod.rs` → `register_default_extractors()`
5. Feature-gate if optional: `#[cfg(feature = "my-format")]`
6. Apply security validators for user content
7. Add tests with fixture files

Related Skills

extraction-pipeline-patterns

7385

from kreuzberg-dev/kreuzberg

extraction pipeline patterns

table-extraction-and-reconstruction

7385

from kreuzberg-dev/kreuzberg

taule extraction and reconstruction

extraction-quality-testing

7385

from kreuzberg-dev/kreuzberg

extraction quality testing

kreuzberg

7385

from kreuzberg-dev/kreuzberg

Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.