doc-to-markdown

Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "解析word", "转换文档".

25 stars

Best use case

doc-to-markdown is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "解析word", "转换文档".

Teams using doc-to-markdown should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/doc-to-markdown/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/daymade/claude-code-skills/doc-to-markdown/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/doc-to-markdown/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How doc-to-markdown Compares

Feature / Agentdoc-to-markdownStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "解析word", "转换文档".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Doc to Markdown

Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.

**Architecture**: Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add).

## Quick Start

```bash
# DOCX → Markdown (one command, zero manual fixes)
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media

# PDF → Markdown
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md

# Run tests
uv run --with pytest pytest scripts/test_convert.py -v
```

## Dual Mode

| Mode | Speed | Quality | Use Case |
|------|-------|---------|----------|
| **Quick** (default) | Fast | Good | Drafts, simple documents |
| **Heavy** | Slower | Best | Final documents, complex layouts |

## Tool Selection

| Format | Quick Mode | Heavy Mode |
|--------|-----------|------------|
| PDF | pymupdf4llm | pymupdf4llm + markitdown |
| DOCX | pandoc + post-processing | pandoc + markitdown |
| PPTX | markitdown | markitdown + pandoc |
| XLSX | markitdown | markitdown |

## DOCX Post-Processing (automatic)

When converting DOCX via pandoc, 8 cleanups are applied automatically:

| Problem | Fix | Test coverage |
|---------|-----|---------------|
| Grid tables (`+:---+`) | Single-column → blockquote, multi-column → pipe table | `TestPostprocessPipeline` |
| Simple tables (`  ---- ----`) | Multi-column images → pipe table with captions | `TestSimpleTable` |
| Image path nesting (`media/media/`) | Flatten to `media/`, absolute → relative | `test_stats_tracking` |
| Pandoc attributes (`{width="..."}`) | Removed | `test_pandoc_attributes_removed` |
| CJK bold spacing (`**粗体**中文`) | Add space around `**` for CJK bold spans | `TestCjkBoldSpacing` (15 cases) |
| Indented dashed code blocks | → fenced ``` with language detection | `test_code_block_with_language` |
| Escaped brackets (`\[...\]`) | → `[...]` | `test_escaped_brackets_fixed` |
| Double-bracket links (`[[text]](url)`) | → `[text](url)` | `test_double_bracket_links_fixed` |

### CJK Bold Spacing — why and how

DOCX uses run-level styling (no spaces between bold/normal runs in CJK text). Markdown renderers need whitespace around `**` to recognize bold boundaries.

**Rule**: if a `**content**` span contains any CJK character, ensure both sides have a space — unless already spaced or at line boundary. This handles CJK punctuation, emoji adjacency, and mixed content.

```
Before: 打开**飞书**,就可以    → some renderers fail to bold
After:  打开 **飞书** ,就可以  → universally renders correctly
```

## Heavy Mode Workflow

Heavy Mode runs multiple tools in parallel and selects the best segments:

1. **Parallel Execution**: Run all applicable tools simultaneously
2. **Segment Analysis**: Parse each output into segments (tables, headings, images, paragraphs)
3. **Quality Scoring**: Score each segment based on completeness and structure
4. **Intelligent Merge**: Select best version of each segment across tools

### Merge Criteria

| Segment Type | Selection Criteria |
|--------------|-------------------|
| Tables | More rows/columns, proper header separator |
| Images | Alt text present, local paths preferred |
| Headings | Proper hierarchy, appropriate length |
| Lists | More items, nested structure preserved |
| Paragraphs | Content completeness |

## Image Extraction

```bash
# Extract images with metadata
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets

# Generate markdown references file
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md
```

Output:
- Images: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- Metadata: `assets/images_metadata.json` (page, position, dimensions)

## Quality Validation

```bash
# Validate conversion quality
uv run --with pymupdf scripts/validate_output.py document.pdf output.md

# Generate HTML report
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
```

### Quality Metrics

| Metric | Pass | Warn | Fail |
|--------|------|------|------|
| Text Retention | >95% | 85-95% | <85% |
| Table Retention | 100% | 90-99% | <90% |
| Image Retention | 100% | 80-99% | <80% |

## Merge Outputs Manually

```bash
# Merge multiple markdown files
python scripts/merge_outputs.py output1.md output2.md -o merged.md

# Show segment attribution
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
```

## Path Conversion (Windows/WSL)

```bash
# Windows to WSL conversion
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
# Output: /mnt/c/Users/name/Documents/file.pdf
```

## Common Issues

**"No conversion tools available"**
```bash
# Install all tools
pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandoc
```

**FontBBox warnings during PDF conversion**
- Harmless font parsing warnings, output is still correct

**Images missing from output**
- Use Heavy Mode for better image preservation
- Or extract separately with `scripts/extract_pdf_images.py`

**Tables broken in output**
- Use Heavy Mode - it selects the most complete table version
- Or validate with `scripts/validate_output.py`

## Bundled Scripts

| Script | Purpose |
|--------|---------|
| `convert.py` | Main orchestrator with Quick/Heavy mode + DOCX post-processing |
| `test_convert.py` | 31 tests covering all post-processing functions |
| `merge_outputs.py` | Merge multiple markdown outputs |
| `validate_output.py` | Quality validation with HTML report |
| `extract_pdf_images.py` | PDF image extraction with metadata |
| `convert_path.py` | Windows to WSL path converter |

## References

- `references/benchmark-2026-03-22.md` - 5-tool benchmark (Docling/MarkItDown/Pandoc/Mammoth/ours)
- `references/heavy-mode-guide.md` - Detailed Heavy Mode documentation
- `references/tool-comparison.md` - Tool capabilities comparison
- `references/conversion-examples.md` - Batch operation examples

Related Skills

markdown-converter

25
from ComeOnOliver/skillshub

Convert documents and files to Markdown using markitdown. Use when converting PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .xls), HTML, CSV, JSON, XML, images (with EXIF/OCR), audio (with transcription), ZIP archives, YouTube URLs, or EPubs to Markdown format for LLM processing or text analysis.

update-markdown-file-index

25
from ComeOnOliver/skillshub

Update a markdown file section with an index/table of files from a specified folder.

markdown-to-html

25
from ComeOnOliver/skillshub

Convert Markdown files to HTML similar to `marked.js`, `pandoc`, `gomarkdown/markdown`, or similar tools; or writing custom script to convert markdown to html and/or working on web template systems like `jekyll/jekyll`, `gohugoio/hugo`, or similar web templating systems that utilize markdown documents, converting them to html. Use when asked to "convert markdown to html", "transform md to html", "render markdown", "generate html from markdown", or when working with .md files and/or web a templating system that converts markdown to HTML output. Supports CLI and Node.js workflows with GFM, CommonMark, and standard Markdown flavors.

markdown-tools

25
from ComeOnOliver/skillshub

Converts documents to markdown with multi-tool orchestration for best quality. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Use when converting PDF/DOCX/PPTX files to markdown, extracting images from documents, validating conversion quality, or needing LLM-optimized document output.

Markdown Exporter

25
from ComeOnOliver/skillshub

Markdown Exporter is an Agent Skill that transforms your Markdown text into a wide variety of professional format files.

woocommerce-markdown

25
from ComeOnOliver/skillshub

Guidelines for creating and modifying markdown files in WooCommerce. Use when writing documentation, README files, or any markdown content.

obsidian-markdown

25
from ComeOnOliver/skillshub

Create and edit Obsidian Flavored Markdown with wikilinks, embeds, callouts, properties, and other Obsidian-specific syntax. Use when working with .md files in Obsidian, or when the user mentions wikilinks, callouts, frontmatter, tags, embeds, or Obsidian notes.

baoyu-url-to-markdown

25
from ComeOnOliver/skillshub

Fetch any URL and convert to markdown using Chrome CDP. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown.

baoyu-markdown-to-html

25
from ComeOnOliver/skillshub

Converts Markdown to styled HTML with WeChat-compatible themes. Supports code highlighting, math, PlantUML, footnotes, alerts, and infographics. Use when user asks for "markdown to html", "convert md to html", "md转html", or needs styled HTML output from markdown.

baoyu-format-markdown

25
from ComeOnOliver/skillshub

Formats plain text or markdown files with frontmatter, titles, summaries, headings, bold, lists, and code blocks. Use when user asks to "format markdown", "beautify article", "add formatting", or improve article layout. Outputs to {filename}-formatted.md.

baoyu-danger-x-to-markdown

25
from ComeOnOliver/skillshub

Convert X (Twitter) tweet or article URL to markdown. Uses reverse-engineered X API (private). Requires user consent before use.

markdown-toc

25
from ComeOnOliver/skillshub

Use when generating or updating Table of Contents in markdown files. Supports multiple files, glob patterns, configurable header levels, and various insertion modes. Triggered by "generate toc", "update toc", "table of contents", "add toc to markdown".