markitdown

Convert various file formats (PDF, Office documents, images, audio, web content, structured data) to Markdown optimized for LLM processing. Use when converting documents to markdown, extracting text from PDFs/Office files, transcribing audio, performing OCR on images, extracting YouTube transcripts, or processing batches of files. Supports 20+ formats including DOCX, XLSX, PPTX, PDF, HTML, EPUB, CSV, JSON, images with OCR, and audio with transcription.

7 stars

Best use case

markitdown is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Convert various file formats (PDF, Office documents, images, audio, web content, structured data) to Markdown optimized for LLM processing. Use when converting documents to markdown, extracting text from PDFs/Office files, transcribing audio, performing OCR on images, extracting YouTube transcripts, or processing batches of files. Supports 20+ formats including DOCX, XLSX, PPTX, PDF, HTML, EPUB, CSV, JSON, images with OCR, and audio with transcription.

Teams using markitdown should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/markitdown/SKILL.md --create-dirs "https://raw.githubusercontent.com/sanand0/scientific-research/main/.claude/skills/markitdown/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/markitdown/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How markitdown Compares

Feature / AgentmarkitdownStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Convert various file formats (PDF, Office documents, images, audio, web content, structured data) to Markdown optimized for LLM processing. Use when converting documents to markdown, extracting text from PDFs/Office files, transcribing audio, performing OCR on images, extracting YouTube transcripts, or processing batches of files. Supports 20+ formats including DOCX, XLSX, PPTX, PDF, HTML, EPUB, CSV, JSON, images with OCR, and audio with transcription.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# MarkItDown

## Overview

MarkItDown is a Python utility that converts various file formats into Markdown format, optimized for use with large language models and text analysis pipelines. It preserves document structure (headings, lists, tables, hyperlinks) while producing clean, token-efficient Markdown output.

## When to Use This Skill

Use this skill when users request:
- Converting documents to Markdown format
- Extracting text from PDF, Word, PowerPoint, or Excel files
- Performing OCR on images to extract text
- Transcribing audio files to text
- Extracting YouTube video transcripts
- Processing HTML, EPUB, or web content to Markdown
- Converting structured data (CSV, JSON, XML) to readable Markdown
- Batch converting multiple files or ZIP archives
- Preparing documents for LLM analysis or RAG systems

## Core Capabilities

### 1. Document Conversion

Convert Office documents and PDFs to Markdown while preserving structure.

**Supported formats:**
- PDF files (with optional Azure Document Intelligence integration)
- Word documents (DOCX)
- PowerPoint presentations (PPTX)
- Excel spreadsheets (XLSX, XLS)

**Basic usage:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```

**Command-line:**
```bash
markitdown document.pdf -o output.md
```

See `references/document_conversion.md` for detailed documentation on document-specific features.

### 2. Media Processing

Extract text from images using OCR and transcribe audio files to text.

**Supported formats:**
- Images (JPEG, PNG, GIF, etc.) with EXIF metadata extraction
- Audio files with speech transcription (requires speech_recognition)

**Image with OCR:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("image.jpg")
print(result.text_content)  # Includes EXIF metadata and OCR text
```

**Audio transcription:**
```python
result = md.convert("audio.wav")
print(result.text_content)  # Transcribed speech
```

See `references/media_processing.md` for advanced media handling options.

### 3. Web Content Extraction

Convert web-based content and e-books to Markdown.

**Supported formats:**
- HTML files and web pages
- YouTube video transcripts (via URL)
- EPUB books
- RSS feeds

**YouTube transcript:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
```

See `references/web_content.md` for web extraction details.

### 4. Structured Data Handling

Convert structured data formats to readable Markdown tables.

**Supported formats:**
- CSV files
- JSON files
- XML files

**CSV to Markdown table:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content)  # Formatted as Markdown table
```

See `references/structured_data.md` for format-specific options.

### 5. Advanced Integrations

Enhance conversion quality with AI-powered features.

**Azure Document Intelligence:**
For enhanced PDF processing with better table extraction and layout analysis:
```python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<endpoint>", docintel_key="<key>")
result = md.convert("complex.pdf")
```

**LLM-Powered Image Descriptions:**
Generate detailed image descriptions using GPT-4o:
```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")  # Images described with LLM
```

See `references/advanced_integrations.md` for integration details.

### 6. Batch Processing

Process multiple files or entire ZIP archives at once.

**ZIP file processing:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("archive.zip")
print(result.text_content)  # All files converted and concatenated
```

**Batch script:**
Use the provided batch processing script for directory conversion:
```bash
python scripts/batch_convert.py /path/to/documents /path/to/output
```

See `scripts/batch_convert.py` for implementation details.

## Installation

**Full installation (all features):**
```bash
uv pip install 'markitdown[all]'
```

**Modular installation (specific features):**
```bash
uv pip install 'markitdown[pdf]'           # PDF support
uv pip install 'markitdown[docx]'          # Word support
uv pip install 'markitdown[pptx]'          # PowerPoint support
uv pip install 'markitdown[xlsx]'          # Excel support
uv pip install 'markitdown[audio]'         # Audio transcription
uv pip install 'markitdown[youtube]'       # YouTube transcripts
```

**Requirements:**
- Python 3.10 or higher

## Output Format

MarkItDown produces clean, token-efficient Markdown optimized for LLM consumption:
- Preserves headings, lists, and tables
- Maintains hyperlinks and formatting
- Includes metadata where relevant (EXIF, document properties)
- No temporary files created (streaming approach)

## Common Workflows

**Preparing documents for RAG:**
```python
from markitdown import MarkItDown

md = MarkItDown()

# Convert knowledge base documents
docs = ["manual.pdf", "guide.docx", "faq.html"]
markdown_content = []

for doc in docs:
    result = md.convert(doc)
    markdown_content.append(result.text_content)

# Now ready for embedding and indexing
```

**Document analysis pipeline:**
```bash
# Convert all PDFs in directory
for file in documents/*.pdf; do
    markitdown "$file" -o "markdown/$(basename "$file" .pdf).md"
done
```

## Plugin System

MarkItDown supports extensible plugins for custom conversion logic. Plugins are disabled by default for security:

```python
from markitdown import MarkItDown

# Enable plugins if needed
md = MarkItDown(enable_plugins=True)
```

## Resources

This skill includes comprehensive reference documentation for each capability:

- **references/document_conversion.md** - Detailed PDF, DOCX, PPTX, XLSX conversion options
- **references/media_processing.md** - Image OCR and audio transcription details
- **references/web_content.md** - HTML, YouTube, and EPUB extraction
- **references/structured_data.md** - CSV, JSON, XML conversion formats
- **references/advanced_integrations.md** - Azure Document Intelligence and LLM integration
- **scripts/batch_convert.py** - Batch processing utility for directories

Related Skills

zinc-database

7
from sanand0/scientific-research

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

zarr-python

7
from sanand0/scientific-research

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

vaex

7
from sanand0/scientific-research

Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that don't fit in memory.

uspto-database

7
from sanand0/scientific-research

Access USPTO APIs for patent/trademark searches, examination history (PEDS), assignments, citations, office actions, TSDR, for IP analysis and prior art searches.

uniprot-database

7
from sanand0/scientific-research

Direct REST API access to UniProt. Protein searches, FASTA retrieval, ID mapping, Swiss-Prot/TrEMBL. For Python workflows with multiple databases, prefer bioservices (unified interface to 40+ services). Use this for direct HTTP/REST work or UniProt-specific control.

umap-learn

7
from sanand0/scientific-research

UMAP dimensionality reduction. Fast nonlinear manifold learning for 2D/3D visualization, clustering preprocessing (HDBSCAN), supervised/parametric UMAP, for high-dimensional data.

transformers

7
from sanand0/scientific-research

This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.

torchdrug

7
from sanand0/scientific-research

Graph-based drug discovery toolkit. Molecular property prediction (ADMET), protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis, GNNs (GIN, GAT, SchNet), 40+ datasets, for PyTorch-based ML on molecules, proteins, and biomedical graphs.

torch-geometric

7
from sanand0/scientific-research

Graph Neural Networks (PyG). Node/graph classification, link prediction, GCN, GAT, GraphSAGE, heterogeneous graphs, molecular property prediction, for geometric deep learning.

tooluniverse

7
from sanand0/scientific-research

Use this skill when working with scientific research tools and workflows across bioinformatics, cheminformatics, genomics, structural biology, proteomics, and drug discovery. This skill provides access to 600+ scientific tools including machine learning models, datasets, APIs, and analysis packages. Use when searching for scientific tools, executing computational biology workflows, composing multi-step research pipelines, accessing databases like OpenTargets/PubChem/UniProt/PDB/ChEMBL, performing tool discovery for research tasks, or integrating scientific computational resources into LLM workflows.

sympy

7
from sanand0/scientific-research

Use this skill when working with symbolic mathematics in Python. This skill should be used for symbolic computation tasks including solving equations algebraically, performing calculus operations (derivatives, integrals, limits), manipulating algebraic expressions, working with matrices symbolically, physics calculations, number theory problems, geometry computations, and generating executable code from mathematical expressions. Apply this skill when the user needs exact symbolic results rather than numerical approximations, or when working with mathematical formulas that contain variables and parameters.

string-database

7
from sanand0/scientific-research

Query STRING API for protein-protein interactions (59M proteins, 20B interactions). Network analysis, GO/KEGG enrichment, interaction discovery, 5000+ species, for systems biology.