chunking-embeddings

chunking emueddings

7,385 stars

Best use case

chunking-embeddings is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

chunking emueddings

Teams using chunking-embeddings should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/chunking-embeddings/SKILL.md --create-dirs "https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/.ai-rulez/skills/chunking-embeddings/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/chunking-embeddings/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How chunking-embeddings Compares

Feature / Agent	chunking-embeddings	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

chunking emueddings

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

SKILL.md Source

# Chunking & Embeddings

**Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration**

## Chunking Architecture Overview

**Location**: `crates/kreuzberg/src/chunking/`, `crates/kreuzberg/src/embeddings.rs`

```text
Extracted Text
    |
[1. Normalization] -> Clean whitespace, remove control chars
    |
[2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
    |
[3. Overlap Management] -> Control context window overlap
    |
[4. Optional Embedding] -> Generate vectors with FastEmbed
    |
Output: Vec<Chunk> with text, vectors, metadata
```

## Chunking Strategies

**Location**: `crates/kreuzberg/src/chunking/mod.rs`

| Strategy | Pattern | Best For |
|----------|---------|----------|
| **Fixed-Size** | Sliding window with configurable overlap | Uniform chunks for embedding models with fixed token limits |
| **Semantic** | Split by sentences, merge/split by similarity threshold | Smart context preservation for LLM consumption and semantic search |
| **Syntax-Aware** | Split by paragraph/section/heading/code-block structure | Preserving document structure (sections, code blocks) in RAG |
| **Recursive** (LangChain pattern) | Try separators in order: `\n\n`, `\n`, ``,`` | Best general-purpose chunking; auto-finds optimal split points |

Key config fields per strategy (see struct definitions in `chunking/mod.rs`):

- Fixed-Size: `chunk_size`, `overlap`, `trim_whitespace`
- Semantic: `target_chunk_size`, `min/max_chunk_size`, `semantic_threshold`, `use_sentence_boundaries`
- Syntax-Aware: `chunk_by` (Paragraph/Section/Heading/Sentence/CodeBlock), `max_chunk_size`, `respect_code_blocks`
- Recursive: `separators[]`, `chunk_size`, `overlap`

## Chunking Configuration Presets

**Location**: `crates/kreuzberg/src/chunking/mod.rs`

| Preset | Chunk Size | Overlap | Strategy | Use Case |
|--------|-----------|---------|----------|----------|
| **Balanced** | 512 tokens | 50 | Semantic | RAG sweet spot |
| **Compact** | 256 tokens | 32 | Fixed-Size | Dense vectors |
| **Extended** | 1024 tokens | 100 | Recursive | Full context |
| **Minimal** | 128 tokens | 16 | (default) | Lightweight embeddings |

Usage: set `config.chunking.preset = Some("balanced")` in `ExtractionConfig`.

## Embedding Generation with FastEmbed

**Location**: `crates/kreuzberg/src/embeddings.rs`

### Model Selection

| Model | Dimensions | Notes |
|-------|-----------|-------|
| `BAAI/bge-small-en-v1.5` (default) | 384 | Fast, excellent for RAG |
| `BAAI/bge-small-zh-v1.5` | 384 | Chinese optimized |
| `BAAI/bge-base-en-v1.5` | 768 | Better quality, slower |
| `jinaai/jina-embeddings-v2-base-en` | 768 | Long context (up to 8192 tokens) |
| `Custom(path)` | varies | Custom ONNX model path |

### Embedding Pattern

`TextEmbeddingManager` provides singleton-cached models per config. Pattern:

1. `get_or_init_model()` -- lazy-loads ONNX model (downloads if needed), caches in `Arc<RwLock<HashMap>>`
2. `embed_chunks()` -- collects chunk texts, calls `model.embed(texts, batch_size)`, zips results back to `ChunkWithEmbedding`

Default config: `batch_size=256`, `device=CPU`, `parallel_requests=4`.

### ONNX Runtime Requirement

Embeddings require ONNX Runtime. Feature-gated via:

```toml
[features]
embeddings = ["dep:fastembed", "dep:ort"]
```

Install: `brew install onnxruntime` (macOS) / `apt install libonnxruntime libonnxruntime-dev` (Linux). Verify: `echo $ORT_DYLIB_PATH`.

## RAG Integration Pattern

The full extraction-to-RAG pipeline:

1. **Extract**: `extract_file(path, config)` -> `ExtractionResult`
2. **Chunk**: Apply preset strategy to `result.content` -> `Vec<Chunk>`
3. **Embed**: If embedding config present, `TextEmbeddingManager::embed_chunks()` -> `Vec<ChunkWithEmbedding>`
4. **Output**: `RagDocument { file_path, metadata, chunks }` ready for vector DB ingestion

See `ChunkWithEmbedding` struct in `types.rs`: contains `text`, `embedding: Vec<f32>`, `dimensions`, `norm`, `metadata`.

## Critical Rules

1. **Chunking is preprocessing** - Always apply before embedding to ensure consistent vector sizes
2. **Overlap prevents information loss** - Set overlap to 15-20% of chunk size
3. **Embedding models are stateful** - Lazy load and cache to avoid repeated initialization
4. **ONNX Runtime is required** - Gracefully degrade if not available (skip embeddings)
5. **Batch embedding for performance** - Never embed single chunks; batch 50-1000 chunks
6. **Normalize embeddings for search** - Use L2 norm for cosine similarity
7. **Cache embedding results** - Don't re-embed identical text chunks
8. **Model selection impacts quality** - bge-small (384) for speed, bge-base (768) for quality

## Related Skills

- **extraction-pipeline-patterns** - Text extraction preceding chunking
- **api-server-mcp** - Endpoint for chunking + embedding operations
- **ocr-backend-management** - OCR text quality affects chunking success

Related Skills

kreuzberg

7385

from kreuzberg-dev/kreuzberg

Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.