Kreuzberg Skill

The **Kreuzberg** skill provides a high-performance, polyglot document intelligence interface for extracting text, metadata, and structured information from 97+ file formats.

7 stars

bycodata

View on GitHub Installation ↓

Best use case

Kreuzberg Skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

The **Kreuzberg** skill provides a high-performance, polyglot document intelligence interface for extracting text, metadata, and structured information from 97+ file formats.

Teams using Kreuzberg Skill should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/kreuzberg/SKILL.md --create-dirs "https://raw.githubusercontent.com/codata/croissant-toolkit/main/.gemini/skills/kreuzberg/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/kreuzberg/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Kreuzberg Skill Compares

Feature / Agent	Kreuzberg Skill	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

The **Kreuzberg** skill provides a high-performance, polyglot document intelligence interface for extracting text, metadata, and structured information from 97+ file formats.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Kreuzberg Skill

The **Kreuzberg** skill provides a high-performance, polyglot document intelligence interface for extracting text, metadata, and structured information from 97+ file formats.

## Features
- **Wide Format Support**: PDF, Word, Excel, PowerPoint, academic formats, and 305 programming languages.
- **Structured Output**: Extracts text as GFM (GitHub Flavored Markdown) with support for tables and code blocks.
- **Rust-Powered**: Extreme performance with a Rust core and native PDFium.
- **Intelligence**: Integrated Tree-sitter for code analysis and OCR support for images/PDFs.

## Tools and Scripts

### 1. `extract.py`
A comprehensive extraction tool for processing single files or directories.

**Usage:**
```bash
python3 .gemini/skills/kreuzberg/scripts/extract.py <INPUT_PATH> [OUTPUT_FILE]
```

**Options:**
- `<INPUT_PATH>`: Path to a file or directory.
- `[OUTPUT_FILE]`: (Optional) path to save the extracted Markdown. If omitted, results go to `data/extrated/`.
- `--json`: (Internal) Output full metadata including Croissant JSON-LD.

## Examples

### Extracting a PDF to Markdown
```bash
python3 .gemini/skills/kreuzberg/scripts/extract.py research_paper.pdf
```

### Processing a DOCX file
```bash
python3 .gemini/skills/kreuzberg/scripts/extract.py legal_contract.docx
```

### Code Intelligence (Python File)
```bash
python3 .gemini/skills/kreuzberg/scripts/extract.py scripts/core.py
```

## Integration with Croissant
The skill automatically generates a Croissant-compatible `FileObject` for the extracted content, ensuring it can be seamlessly integrated into data pipelines.

Related Skills

walker

from codata/croissant-toolkit

Deep crawl functionality that extracts and visits internal links from a webpage.

orchestrator_expert

from codata/croissant-toolkit

Orchestrator agent that has comprehensive knowledge and command over all available skills in this toolkit to create complex workflows.

neo4j_expert

from codata/croissant-toolkit

Store and query Croissant datasets in a Neo4j Graph Database for relational discovery and semantic search.

youtuber

from codata/croissant-toolkit

Search for videos on YouTube based on specific keywords. Get list of videos with title, description, and URL.

wizard

from codata/croissant-toolkit

The ultimate data integrator. Orchestrates transcription, translation, NLP analysis, and Croissant serialization into a single automated pipeline.

unf

from codata/croissant-toolkit

Universal Numeric Fingerprint (UNF) generator. For strings, it splits into words and sorts them alphabetically to provide order-invariant fingerprints. Supports dataframes and files too.

translator

from codata/croissant-toolkit

Recognize the language of input content or video scripts and translate them precisely into English using Gemini 3.

transcriber

from codata/croissant-toolkit

Fetch and store transcripts from YouTube videos for deep content analysis.

telegram_expert

from codata/croissant-toolkit

Send results and notifications to Telegram channels or users.

rohub

from codata/croissant-toolkit

Deposit research objects and add semantic annotations to the RO-Hub portal using the rohub library.

ro-crate-expert

from codata/croissant-toolkit

Specialized in creating RO-Crate packages from Dataverse metadata, with integrated ODRL-based DID (Decentralized Identifier) attribution and provenance via the ro-crate-py library.

📊 Presentation Expert Skill

from codata/croissant-toolkit

The **Presentation Expert** is responsible for transforming complex research data, metadata, and insights into high-impact presentation decks.