Best use case
chunking-embeddings is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
chunking emueddings
Teams using chunking-embeddings should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/chunking-embeddings/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How chunking-embeddings Compares
| Feature / Agent | chunking-embeddings | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
chunking emueddings
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
SKILL.md Source
# Chunking & Embeddings
**Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration**
## Chunking Architecture Overview
**Location**: `crates/kreuzberg/src/chunking/`, `crates/kreuzberg/src/embeddings.rs`
```text
Extracted Text
|
[1. Normalization] -> Clean whitespace, remove control chars
|
[2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
|
[3. Overlap Management] -> Control context window overlap
|
[4. Optional Embedding] -> Generate vectors with FastEmbed
|
Output: Vec<Chunk> with text, vectors, metadata
```
## Chunking Strategies
**Location**: `crates/kreuzberg/src/chunking/mod.rs`
| Strategy | Pattern | Best For |
|----------|---------|----------|
| **Fixed-Size** | Sliding window with configurable overlap | Uniform chunks for embedding models with fixed token limits |
| **Semantic** | Split by sentences, merge/split by similarity threshold | Smart context preservation for LLM consumption and semantic search |
| **Syntax-Aware** | Split by paragraph/section/heading/code-block structure | Preserving document structure (sections, code blocks) in RAG |
| **Recursive** (LangChain pattern) | Try separators in order: `\n\n`, `\n`, ``,`` | Best general-purpose chunking; auto-finds optimal split points |
Key config fields per strategy (see struct definitions in `chunking/mod.rs`):
- Fixed-Size: `chunk_size`, `overlap`, `trim_whitespace`
- Semantic: `target_chunk_size`, `min/max_chunk_size`, `semantic_threshold`, `use_sentence_boundaries`
- Syntax-Aware: `chunk_by` (Paragraph/Section/Heading/Sentence/CodeBlock), `max_chunk_size`, `respect_code_blocks`
- Recursive: `separators[]`, `chunk_size`, `overlap`
## Chunking Configuration Presets
**Location**: `crates/kreuzberg/src/chunking/mod.rs`
| Preset | Chunk Size | Overlap | Strategy | Use Case |
|--------|-----------|---------|----------|----------|
| **Balanced** | 512 tokens | 50 | Semantic | RAG sweet spot |
| **Compact** | 256 tokens | 32 | Fixed-Size | Dense vectors |
| **Extended** | 1024 tokens | 100 | Recursive | Full context |
| **Minimal** | 128 tokens | 16 | (default) | Lightweight embeddings |
Usage: set `config.chunking.preset = Some("balanced")` in `ExtractionConfig`.
## Embedding Generation with FastEmbed
**Location**: `crates/kreuzberg/src/embeddings.rs`
### Model Selection
| Model | Dimensions | Notes |
|-------|-----------|-------|
| `BAAI/bge-small-en-v1.5` (default) | 384 | Fast, excellent for RAG |
| `BAAI/bge-small-zh-v1.5` | 384 | Chinese optimized |
| `BAAI/bge-base-en-v1.5` | 768 | Better quality, slower |
| `jinaai/jina-embeddings-v2-base-en` | 768 | Long context (up to 8192 tokens) |
| `Custom(path)` | varies | Custom ONNX model path |
### Embedding Pattern
`TextEmbeddingManager` provides singleton-cached models per config. Pattern:
1. `get_or_init_model()` -- lazy-loads ONNX model (downloads if needed), caches in `Arc<RwLock<HashMap>>`
2. `embed_chunks()` -- collects chunk texts, calls `model.embed(texts, batch_size)`, zips results back to `ChunkWithEmbedding`
Default config: `batch_size=256`, `device=CPU`, `parallel_requests=4`.
### ONNX Runtime Requirement
Embeddings require ONNX Runtime. Feature-gated via:
```toml
[features]
embeddings = ["dep:fastembed", "dep:ort"]
```
Install: `brew install onnxruntime` (macOS) / `apt install libonnxruntime libonnxruntime-dev` (Linux). Verify: `echo $ORT_DYLIB_PATH`.
## RAG Integration Pattern
The full extraction-to-RAG pipeline:
1. **Extract**: `extract_file(path, config)` -> `ExtractionResult`
2. **Chunk**: Apply preset strategy to `result.content` -> `Vec<Chunk>`
3. **Embed**: If embedding config present, `TextEmbeddingManager::embed_chunks()` -> `Vec<ChunkWithEmbedding>`
4. **Output**: `RagDocument { file_path, metadata, chunks }` ready for vector DB ingestion
See `ChunkWithEmbedding` struct in `types.rs`: contains `text`, `embedding: Vec<f32>`, `dimensions`, `norm`, `metadata`.
## Critical Rules
1. **Chunking is preprocessing** - Always apply before embedding to ensure consistent vector sizes
2. **Overlap prevents information loss** - Set overlap to 15-20% of chunk size
3. **Embedding models are stateful** - Lazy load and cache to avoid repeated initialization
4. **ONNX Runtime is required** - Gracefully degrade if not available (skip embeddings)
5. **Batch embedding for performance** - Never embed single chunks; batch 50-1000 chunks
6. **Normalize embeddings for search** - Use L2 norm for cosine similarity
7. **Cache embedding results** - Don't re-embed identical text chunks
8. **Model selection impacts quality** - bge-small (384) for speed, bge-base (768) for quality
## Related Skills
- **extraction-pipeline-patterns** - Text extraction preceding chunking
- **api-server-mcp** - Endpoint for chunking + embedding operations
- **ocr-backend-management** - OCR text quality affects chunking successRelated Skills
kreuzberg
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
wasm-constraints
wasm constraints
test-execution-patterns
test execution patterns
security-limits-dos-protection
security limits dos protection
plugin-architecture-patterns
plugin architecture patterns
ocr-backend-management
ocr uackend management
mime-detection-routing
mime detection routing
format-specific-extraction
format specific extraction
extraction-pipeline-patterns
extraction pipeline patterns
config-loading-precedence
config loading precedence
api-server-mcp
api server mcp
registry-implementation
registry implementation