ocr-backend-management

ocr uackend management

7,385 stars

Best use case

ocr-backend-management is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

ocr uackend management

Teams using ocr-backend-management should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ocr-backend-management/SKILL.md --create-dirs "https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/.ai-rulez/skills/ocr-backend-management/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ocr-backend-management/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ocr-backend-management Compares

Feature / Agent	ocr-backend-management	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

ocr uackend management

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

SKILL.md Source

# OCR Backend Management

**Kreuzberg's pluggable OCR engine selection, image preprocessing, and table detection**

## OCR Architecture Overview

**Location**: `crates/kreuzberg/src/ocr/`, `crates/kreuzberg-tesseract/`, `crates/kreuzberg/src/extraction/image.rs`

Kreuzberg provides a pluggable OCR system with multiple backend implementations:

| Backend | Language | Module | Features |
|---------|----------|--------|----------|
| **Tesseract** (Default) | Rust FFI binding | `kreuzberg-tesseract` crate | Multi-language, HOCR output, Tesseract 5.x |
| **PaddleOCR** | Rust (ONNX Runtime) | `kreuzberg-paddle-ocr` crate | 80+ languages, 11 script families, auto model download |
| **EasyOCR** | Python-only | `kreuzberg-py` embedding | 80+ languages, GPU support |

## Core OCR Components

### 1. OCR Module Structure

```text
crates/kreuzberg/src/ocr/
├── mod.rs                 # OCR system orchestration
├── processor.rs           # Main OCR processing pipeline
├── tesseract_backend.rs   # Native Tesseract integration (30KB)
├── language_registry.rs   # Language pack management (17KB)
├── cache.rs              # OCR result caching (15KB)
├── hocr.rs               # HOCR parsing (6KB)
├── types.rs              # OCR configuration types (13KB)
└── table/
    ├── detection.rs      # Table boundary detection
    └── reconstruction.rs # Structured table output
```

### 2. Tesseract Backend Integration

**Location**: `crates/kreuzberg/src/ocr/tesseract_backend.rs`

```rust
pub struct TesseractConfig {
    // Engine configuration
    pub tessdata_prefix: Option<String>,  // Custom Tesseract data dir
    pub language_pack: Option<String>,    // Language: "eng", "fra+deu", etc.
    pub oem: Option<OcrEngineMode>,      // 0=Legacy, 1=Neural, 2=Both, 3=Default

    // Processing options
    pub psm: Option<PageSegmentationMode>, // PSM 0-13 (see Tesseract docs)
    pub dpi: Option<u32>,                  // Image DPI for scaling
    pub quality_threshold: f32,            // Min confidence 0.0-1.0

    // Output
    pub output_hocr: bool,                 // Enable hOCR bounding boxes
    pub extract_tables: bool,              // Auto table detection
}
```

**Key Pattern**: Lazy-initialize Tesseract instance per language pack to avoid repeated startup overhead.

```rust
// Location: ocr/processor.rs
lazy_static! {
    static ref TESSERACT_INSTANCES: RwLock<HashMap<String, TesseractInstance>> = RwLock::new(HashMap::new());
}

async fn get_or_init_tesseract(language: &str) -> Result<TesseractInstance> {
    let mut instances = TESSERACT_INSTANCES.write().map_err(|e| OCRError::LockPoisoned)?;

    if let Some(instance) = instances.get(language) {
        Ok(instance.clone())
    } else {
        let instance = TesseractInstance::new(language)?;
        instances.insert(language.to_string(), instance.clone());
        Ok(instance)
    }
}
```

### 3. Language Pack Management

**Location**: `crates/kreuzberg/src/ocr/language_registry.rs` (17KB)

```rust
pub struct LanguageRegistry {
    available_languages: HashMap<String, LanguageMetadata>,
    cache: RwLock<HashMap<String, Vec<u8>>>,  // Cached language packs
}

impl LanguageRegistry {
    pub fn validate_language(&self, lang_code: &str) -> Result<LanguageMetadata> {
        // ISO 639-1/3 validation
        // Check if tessdata available for language
        // Return metadata: script, rtl, confidence_threshold
    }

    pub fn list_available(&self) -> Vec<AvailableLanguage> {
        // Return all installed language packs via tessdata check
    }

    pub fn download_language(&self, lang: &str) -> Result<()> {
        // Download missing language pack if available
        // Cache in TESSDATA_PREFIX
    }
}
```

**Supported Languages**: Tesseract ships with 100+ languages; Kreuzberg validates on startup.

## Image Preprocessing Pipeline

**Location**: `crates/kreuzberg/src/extraction/image.rs` (16KB)

```text
Input Image
    ↓
[1. Load & Validate] → Check format, dimensions, color space
    ↓
[2. Preprocessing] → Denoise, deskew, binarize (optional, config-driven)
    ↓
[3. Scale/DPI Adjustment] → Resample to optimal DPI for OCR
    ↓
[4. Table Detection] → Identify table regions for separate OCR
    ↓
[5. OCR Processing] → Tesseract → hOCR output with bounding boxes
    ↓
[6. Table Reconstruction] → Structured table output (rows/cols)
    ↓
Output: ExtractionResult with text, tables, metadata (EXIF, dimensions)
```

### Preprocessing Configuration

```rust
pub struct ImagePreprocessingConfig {
    // Denoising
    pub denoise_enabled: bool,          // Bilateral filter or morphological
    pub denoise_strength: f32,          // 0.0-1.0 aggressiveness

    // Geometric
    pub deskew_enabled: bool,           // Auto-rotate skewed pages
    pub rotation_threshold_deg: f32,    // Min rotation angle to correct

    // Binarization
    pub binarize_enabled: bool,         // Convert to B&W (improves OCR)
    pub threshold_auto: bool,           // Otsu vs fixed threshold
    pub threshold_value: u8,            // If threshold_auto=false

    // Scaling
    pub target_dpi: u32,                // Resample to DPI (e.g., 300 DPI)
    pub preserve_aspect: bool,          // Don't distort
}
```

**Implementation Pattern**: Use `image` crate + SIMD operations for preprocessing.

```rust
// Location: extraction/image.rs
pub async fn preprocess_image(
    image_data: &[u8],
    config: &ImagePreprocessingConfig,
) -> Result<ImageBuffer<Rgba<u8>, Vec<u8>>> {
    let mut img = image::load_from_memory(image_data)?;

    // Deskew if needed
    if config.deskew_enabled {
        img = deskew_image(img, config.rotation_threshold_deg)?;
    }

    // Denoise if needed
    if config.denoise_enabled {
        img = denoise_image(img, config.denoise_strength)?;
    }

    // Binarize if needed
    if config.binarize_enabled {
        img = binarize_image(img, config.threshold_auto, config.threshold_value)?;
    }

    // DPI adjustment
    if config.target_dpi > 0 {
        img = resample_to_dpi(img, config.target_dpi)?;
    }

    Ok(img)
}
```

## Table Detection & Reconstruction

**Location**: `crates/kreuzberg/src/ocr/table/`

### Table Detection Strategy

```rust
pub struct TableDetectionConfig {
    pub enabled: bool,                 // Enable table detection
    pub min_cells: u32,               // Minimum cell count to consider table
    pub confidence_threshold: f32,    // Min confidence in table structure
    pub line_thickness_threshold: u32, // Min pixels for table borders
    pub merge_adjacent_cells: bool,   // Merge cells with same span
}
```

**Pattern**: Detect table regions via:

1. Horizontal/vertical line detection (Hough transform or morphological operations)
2. Grid structure validation (rows × columns consistency)
3. Cell content extraction (OCR each cell)
4. Reconstruct as `TableData { headers: Vec<String>, rows: Vec<Vec<String>> }`

### HOCR Processing

**Location**: `crates/kreuzberg/src/ocr/hocr.rs`

Tesseract's hOCR output contains XML with bounding boxes:

```xml
<span class="ocr_word" id="word_1_1" title="bbox 100 200 250 230">word</span>
```

Kreuzberg parses hOCR to:

- Extract text with character-level bounding boxes
- Preserve reading order (top-left → bottom-right)
- Identify table structures via bbox clustering
- Return as `TextWithBoundingBoxes { text: String, boxes: Vec<BBox> }`

## Plugin System: Custom OCR Backends

**Location**: `crates/kreuzberg/src/plugins/ocr.rs`

```rust
pub trait OcrBackend: Send + Sync + 'static {
    /// Process image and return text with optional HOCR/tables
    async fn process(
        &self,
        image: &[u8],
        config: &OcrConfig,
    ) -> Result<OcrResult>;

    /// List supported languages
    fn supported_languages(&self) -> Vec<String>;

    /// Health check (verify engine is available)
    async fn health_check(&self) -> Result<()>;
}

pub struct OcrPluginRegistry {
    backends: HashMap<String, Arc<dyn OcrBackend>>,
}

impl OcrPluginRegistry {
    pub fn register(&mut self, name: &str, backend: Arc<dyn OcrBackend>) {
        self.backends.insert(name.to_string(), backend);
    }

    pub fn get_backend(&self, name: &str) -> Result<Arc<dyn OcrBackend>> {
        self.backends.get(name).cloned()
            .ok_or_else(|| OCRError::BackendNotFound(name.to_string()))
    }
}
```

**Usage Pattern**: Register custom backend (e.g., cloud-based OCR, proprietary engine) at startup.

```rust
// Example: Custom AWS Textract backend
let textract_backend = Arc::new(AwsTextractBackend::new(credentials)?);
plugins::ocr::register_backend("aws-textract", textract_backend)?;

// Use via config
let config = ExtractionConfig {
    image: Some(ImageConfig {
        ocr_backend: Some("aws-textract".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};
```

## Language Detection Integration

**Location**: `crates/kreuzberg/src/language_detection/`

Before OCR, auto-detect document language:

```rust
pub struct LanguageDetectionConfig {
    pub enabled: bool,                 // Auto-detect language
    pub confidence_threshold: f32,    // Min confidence 0.0-1.0
    pub fallback_languages: Vec<String>, // If detection fails
}

// Usage in image extraction
let detected_lang = detect_language(&image_text)?;
let ocr_config = ImageConfig {
    ocr_language: Some(detected_lang),
    ..Default::default()
};
```

## Performance Optimization

### Caching Strategy

**Location**: `crates/kreuzberg/src/ocr/cache.rs`

```rust
pub struct OcrResultCache {
    // LRU cache: (image_hash, language, config_hash) → OcrResult
    cache: Arc<RwLock<LruCache<CacheKey, OcrResult>>>,
}

impl OcrResultCache {
    pub async fn get_or_process(
        &self,
        image: &[u8],
        language: &str,
        config: &OcrConfig,
    ) -> Result<OcrResult> {
        let key = CacheKey::new(image, language, config);

        // Check cache
        {
            let cache = self.cache.read()?;
            if let Some(result) = cache.get(&key) {
                return Ok(result.clone());
            }
        }

        // Cache miss: process
        let result = process_with_tesseract(image, language, config).await?;
        self.cache.write()?.put(key, result.clone());

        Ok(result)
    }
}
```

**Cache Key**: Hash of image + language + config to avoid false hits.

### Batch OCR Processing

```rust
// Process multiple images in parallel
pub async fn batch_ocr(
    images: Vec<(String, Vec<u8>)>,  // (file_id, image_data)
    config: &OcrConfig,
) -> Result<Vec<(String, OcrResult)>> {
    let tasks: Vec<_> = images
        .into_iter()
        .map(|(id, data)| tokio::spawn(process_image(id, data, config)))
        .collect();

    futures::future::try_join_all(tasks).await?
}
```

## Configuration Priority (Feature Matrix)

**Location**: `crates/kreuzberg/src/core/config.rs`

OCR backend selection precedence:

1. **Explicit config** - `config.image.ocr_backend = Some("aws-textract")`
2. **Plugin registered** - Check if backend registered in plugins registry
3. **Platform-specific default** - Python: EasyOCR, Others: Tesseract
4. **Tesseract fallback** - Always available if installed
5. **Text-only** - Skip OCR if no backend available (log warning)

## Critical Rules

1. **Always validate language packs** before OCR (Tesseract installation check in health checks)
2. **Preprocess images** before OCR for 10-30% accuracy boost
3. **Cache OCR results** to avoid re-processing identical documents
4. **Table detection is optional** but recommended for document intelligence
5. **Timeout OCR operations** to prevent hanging on extremely large images (config: `ocr_timeout_secs`)
6. **Log confidence scores** in metadata for quality tracking
7. **Preserve EXIF data** from images (orientation, camera, GPS if present)

## Related Skills

- **extraction-pipeline-patterns** - OCR fallback integration in document extraction
- **chunking-embeddings** - Post-OCR text splitting and embedding generation
- **feature-flag-strategy** - Conditional OCR backend compilation (Tesseract, EasyOCR, PaddleOCR)
- **api-server-patterns** - OCR endpoint configuration and async processing

Related Skills

plugin-lifecycle-management

7385

from kreuzberg-dev/kreuzberg

plugin lifecycle management

ocr-backend-plugin

7385

from kreuzberg-dev/kreuzberg

ocr uackend plugin

gil-management-patterns

7385

from kreuzberg-dev/kreuzberg

gil management patterns

tesseract-backend-usage

7385

from kreuzberg-dev/kreuzberg

tesseract uackend usage

python-ocr-backend-integration

7385

from kreuzberg-dev/kreuzberg

python ocr uackend integration

configuration-management

7385

from kreuzberg-dev/kreuzberg

configuration management

kreuzberg

7385

from kreuzberg-dev/kreuzberg

Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.