Document Parser

## Overview

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

Document Parser is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

## Overview

Teams using Document Parser should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/doc-parser/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/doc-parser/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/doc-parser/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Document Parser Compares

Feature / Agent	Document Parser	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

## Overview

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Document Parser

## Overview

Parse complex documents containing tables, figures, multi-column layouts, headers, and mixed content using IBM's docling library. This skill goes beyond simple text extraction by understanding document structure, detecting layout regions, and preserving the logical reading order across complex formatting.

## Instructions

When a user asks to parse a complex document or extract structured content from a document with tables, figures, or multi-column layouts, follow these steps:

### Step 1: Install docling

```bash
pip install docling
```

### Step 2: Load and convert the document

Use docling's DocumentConverter to parse the document:

```python
from docling.document_converter import DocumentConverter

def parse_document(file_path):
    converter = DocumentConverter()
    result = converter.convert(file_path)
    return result
```

Supported input formats: PDF, DOCX, PPTX, HTML, images (PNG, JPG).

### Step 3: Export to the desired format

**Export as Markdown** (preserves headings, tables, lists):

```python
def to_markdown(result):
    return result.document.export_to_markdown()
```

**Export as structured JSON** (full document tree):

```python
import json

def to_json(result):
    doc_dict = result.document.export_to_dict()
    return json.dumps(doc_dict, indent=2)
```

**Extract only tables:**

```python
def extract_tables(result):
    tables = []
    for table in result.document.tables:
        df = table.export_to_dataframe()
        tables.append(df)
    return tables
```

### Step 4: Handle specific content types

**For tables**: Export each table as a pandas DataFrame or CSV:

```python
import pandas as pd

def save_tables(result, output_dir="."):
    for i, table in enumerate(result.document.tables):
        df = table.export_to_dataframe()
        output_path = f"{output_dir}/table_{i+1}.csv"
        df.to_csv(output_path, index=False)
    return f"Saved {len(result.document.tables)} tables"
```

**For figures**: Extract figure metadata and captions:

```python
def extract_figures(result):
    figures = []
    for item in result.document.items:
        if hasattr(item, 'image') and item.image:
            figures.append({
                "caption": getattr(item, 'caption', 'No caption'),
                "page": getattr(item, 'page_no', None)
            })
    return figures
```

### Step 5: Present the results

Format output based on user needs:
- **Full document**: Markdown export with all content preserved
- **Tables only**: CSV files or formatted markdown tables
- **Summary**: Document structure outline with section headings and content types found
- **Specific sections**: Extract targeted parts by heading or page range

## Examples

### Example 1: Parse an academic paper

**User request:** "Parse this research paper and extract all the tables: paper.pdf"

**Actions taken:**
1. Convert paper.pdf using docling
2. Identify 4 tables across 12 pages
3. Export each table as CSV

**Output:**
```
Parsed paper.pdf (12 pages)

Document structure:
  - Title: "Machine Learning Approaches for Climate Prediction"
  - Sections: Abstract, Introduction, Methods, Results, Discussion, References
  - Tables found: 4
  - Figures found: 6

Tables extracted:
  1. table_1.csv - "Dataset Characteristics" (5 cols, 8 rows) - Page 4
  2. table_2.csv - "Model Hyperparameters" (4 cols, 12 rows) - Page 6
  3. table_3.csv - "Performance Comparison" (6 cols, 5 rows) - Page 8
  4. table_4.csv - "Ablation Study Results" (5 cols, 7 rows) - Page 9

All tables saved to current directory.
```

### Example 2: Convert a financial report to Markdown

**User request:** "Convert this quarterly report to Markdown so I can edit it: Q3-report.pdf"

**Actions taken:**
1. Parse Q3-report.pdf with docling (handles multi-column layout)
2. Export as clean Markdown preserving structure

**Output:**
```
Converted Q3-report.pdf to Markdown

Document details:
  - Pages: 24
  - Layout: Multi-column (2-column on pages 3-20)
  - Tables: 8 (all converted to Markdown tables)
  - Figures: 5 (referenced with captions)
  - Sections: 12 headings detected

Output saved to: Q3-report.md (14,200 words)

The multi-column layout has been linearized into single-column reading order.
All tables are preserved as Markdown tables.
```

### Example 3: Extract structured data from an invoice

**User request:** "Parse this invoice and give me the line items as JSON"

**Actions taken:**
1. Parse invoice.pdf with docling
2. Identify the main line items table
3. Export as structured JSON

**Output:**
```json
{
  "invoice_number": "INV-2024-0847",
  "date": "2024-11-15",
  "vendor": "Acme Supplies Inc.",
  "line_items": [
    {"item": "Office Chairs", "qty": 10, "unit_price": 299.99, "total": 2999.90},
    {"item": "Standing Desks", "qty": 5, "unit_price": 549.00, "total": 2745.00},
    {"item": "Monitor Arms", "qty": 15, "unit_price": 89.50, "total": 1342.50}
  ],
  "subtotal": 7087.40,
  "tax": 637.87,
  "total": 7725.27
}
```

## Guidelines

- Docling handles complex layouts automatically including multi-column text, nested tables, and mixed content. Let it do the heavy lifting.
- For very large documents (200+ pages), process in sections to manage memory.
- When tables have merged cells or irregular structures, validate the DataFrame output and flag any parsing anomalies.
- Prefer Markdown export for human-readable output and JSON/dict export for programmatic use.
- If docling fails to install or parse a specific format, fall back to pdfplumber for PDFs or python-docx for DOCX files.
- Always report the document structure (sections, tables, figures found) before detailed output so the user knows what is available.
- For scanned documents without a text layer, docling may use its built-in OCR. If quality is poor, suggest the pdf-ocr skill for better preprocessing.

Related Skills

sdk-documentation-generator

from ComeOnOliver/skillshub

Sdk Documentation Generator - Auto-activating skill for Technical Documentation. Triggers on: sdk documentation generator, sdk documentation generator Part of the Technical Documentation skill category.

pdf-parser

from ComeOnOliver/skillshub

Pdf Parser - Auto-activating skill for Business Automation. Triggers on: pdf parser, pdf parser Part of the Business Automation skill category.

email-parser

from ComeOnOliver/skillshub

Email Parser - Auto-activating skill for Business Automation. Triggers on: email parser, email parser Part of the Business Automation skill category.

document-merger

from ComeOnOliver/skillshub

Document Merger - Auto-activating skill for Business Automation. Triggers on: document merger, document merger Part of the Business Automation skill category.

database-documentation-gen

from ComeOnOliver/skillshub

Process use when you need to work with database documentation. This skill provides automated documentation generation with comprehensive guidance and automation. Trigger with phrases like "generate docs", "document schema", or "create database documentation".

code-documentation-analyzer

from ComeOnOliver/skillshub

Code Documentation Analyzer - Auto-activating skill for Technical Documentation. Triggers on: code documentation analyzer, code documentation analyzer Part of the Technical Documentation skill category.

document-writer

from ComeOnOliver/skillshub

多风格文档写作技能。支持乔木、小红书、Dankoe、微信公众号、Twitter等5种写作风格。Claude 根据内容智能选择风格，按规范撰写文章。

latex-document

from ComeOnOliver/skillshub

Universal LaTeX document skill: create, compile, and convert any document to professional PDF with PNG previews. Supports resumes, reports, cover letters, invoices, academic papers, theses/dissertations, academic CVs, presentations (Beamer), scientific posters, formal letters, exams/quizzes, books, cheat sheets, reference cards, exam formula sheets, fillable PDF forms (hyperref form fields), conditional content (etoolbox toggles), mail merge from CSV/JSON (Jinja2 templates), version diffing (latexdiff), charts (pgfplots + matplotlib), tables (booktabs + CSV import), images (TikZ), Mermaid diagrams, AI-generated images, watermarks, landscape pages, bibliography/citations (BibTeX/biblatex), multi-language/CJK (auto XeLaTeX), algorithms/pseudocode, colored boxes (tcolorbox), SI units (siunitx), Pandoc format conversion (Markdown/DOCX/HTML ↔ LaTeX), and PDF-to-LaTeX conversion of handwritten or printed documents (math, business, legal, general). Compile script supports pdflatex, xelatex, lualatex with auto-detection, latexmk backend, texfot log filtering, PDF/A output, and verbosity control (--verbose/--quiet). Empirically optimized scaling: single agent 1-10 pages, split 11-20, batch-7 pipeline 21+. Use when user asks to: (1) create a resume/CV/cover letter, (2) write a LaTeX document, (3) create PDF with tables/charts/images, (4) compile a .tex file, (5) make a report/invoice/presentation, (6) anything involving LaTeX or pdflatex, (7) convert/OCR a PDF to LaTeX, (8) convert handwritten notes, (9) create charts/graphs/diagrams, (10) create slides, (11) write a thesis or dissertation, (12) create an academic CV, (13) create a poster, (14) create an exam/quiz, (15) create a book, (16) convert between document formats (Markdown, DOCX, HTML to/from LaTeX), (17) generate Mermaid diagrams for LaTeX, (18) create a formal business letter, (19) create a cheat sheet or reference card, (20) create an exam formula sheet or crib sheet, (21) condense lecture notes/PDFs into a cheat sheet, (22) create a fillable PDF form with text fields/checkboxes/dropdowns, (23) create a document with conditional content/toggles (show/hide sections), (24) generate batch/mail-merge documents from CSV/JSON data, (25) create a version diff PDF (latexdiff) highlighting changes between documents, (26) create a homework or assignment submission with problems and solutions, (27) create a lab report with data tables, graphs, and error analysis, (28) encrypt or password-protect a PDF, (29) merge multiple PDFs into one, (30) optimize/compress a PDF for web or email, (31) lint or check a LaTeX document for common issues, (32) count words in a LaTeX document, (33) analyze document statistics (figures, tables, citations), (34) fetch BibTeX from a DOI, (35) convert a Graphviz .dot file to PDF/PNG, (36) convert a PlantUML .puml file to PDF/PNG, (37) create a one-pager/fact sheet/executive summary, (38) create a datasheet or product specification sheet, (39) extract pages from a PDF (page ranges, odd/even), (40) check LaTeX package availability before compiling, (41) analyze citations and cross-reference with .bib files, (42) debug LaTeX compilation errors, (43) make a document accessible (PDF/A, tagged PDF), (44) create lecture notes or course handouts, (45) fill an existing PDF form (fillable fields or non-fillable with annotations), (46) extract text or tables from a PDF (pdfplumber, pypdf), (47) OCR a scanned PDF to text (pytesseract), (48) create a PDF programmatically with reportlab (Canvas, Platypus), (49) rotate or crop PDF pages (pypdf), (50) add a watermark to an existing PDF, (51) extract metadata from a PDF (title, author, subject).

oo-component-documentation

from ComeOnOliver/skillshub

Create or update standardized object-oriented component documentation using a shared template plus mode-specific guidance for new and existing docs.

go-documentation

from ComeOnOliver/skillshub

Use when writing or reviewing documentation for Go packages, types, functions, or methods. Also use proactively when creating new exported types, functions, or packages, even if the user doesn't explicitly ask about documentation. Does not cover code comments for non-exported symbols (see go-style-core).

skywork-document

from ComeOnOliver/skillshub

Generate professional documents in multiple formats (docx, pdf, html, md) from scratch or based on user files. Supports web search for up-to-date content. Use when the expected output is longer than a short answer and benefits from structure and formatting. Do NOT use for short plain-text answers, code files, or casual Q&A.

documentation-templates

from ComeOnOliver/skillshub

Documentation templates and structure guidelines. README, API docs, code comments, and AI-friendly documentation.