mineru-pdf-extractor

Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.

3,891 stars

Best use case

mineru-pdf-extractor is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.

Teams using mineru-pdf-extractor should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/mineru-pdf-extractor/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/a-i-r/mineru-pdf-extractor/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/mineru-pdf-extractor/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How mineru-pdf-extractor Compares

Feature / Agentmineru-pdf-extractorStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# MinerU PDF Extractor

Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.

> **Note**: This is a community skill, not an official MinerU product. You need to obtain your own API key from [MinerU](https://mineru.net/).

---

## 📁 Skill Structure

```
mineru-pdf-extractor/
├── SKILL.md                          # English documentation
├── SKILL_zh.md                       # Chinese documentation
├── docs/                             # Documentation
│   ├── Local_File_Parsing_Guide.md   # Local PDF parsing detailed guide (English)
│   ├── Online_URL_Parsing_Guide.md   # Online PDF parsing detailed guide (English)
│   ├── MinerU_本地文档解析完整流程.md  # Local parsing complete guide (Chinese)
│   └── MinerU_在线文档解析完整流程.md  # Online parsing complete guide (Chinese)
└── scripts/                          # Executable scripts
    ├── local_file_step1_apply_upload_url.sh    # Local parsing Step 1
    ├── local_file_step2_upload_file.sh         # Local parsing Step 2
    ├── local_file_step3_poll_result.sh         # Local parsing Step 3
    ├── local_file_step4_download.sh            # Local parsing Step 4
    ├── online_file_step1_submit_task.sh        # Online parsing Step 1
    └── online_file_step2_poll_result.sh        # Online parsing Step 2
```

---

## 🔧 Requirements

### Required Environment Variables

Scripts automatically read MinerU Token from environment variables (choose one):

```bash
# Option 1: Set MINERU_TOKEN
export MINERU_TOKEN="your_api_token_here"

# Option 2: Set MINERU_API_KEY
export MINERU_API_KEY="your_api_token_here"
```

### Required Command-Line Tools

- `curl` - For HTTP requests (usually pre-installed)
- `unzip` - For extracting results (usually pre-installed)

### Optional Tools

- `jq` - For enhanced JSON parsing and security (recommended but not required)
  - If not installed, scripts will use fallback methods
  - Install: `apt-get install jq` (Debian/Ubuntu) or `brew install jq` (macOS)

### Optional Configuration

```bash
# Set API base URL (default is pre-configured)
export MINERU_BASE_URL="https://mineru.net/api/v4"
```

> 💡 **Get Token**: Visit https://mineru.net/apiManage/docs to register and obtain an API Key

---

## 📄 Feature 1: Parse Local PDF Documents

For locally stored PDF files. Requires 4 steps.

### Quick Start

```bash
cd scripts/

# Step 1: Apply for upload URL
./local_file_step1_apply_upload_url.sh /path/to/your.pdf
# Output: BATCH_ID=xxx UPLOAD_URL=xxx

# Step 2: Upload file
./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf

# Step 3: Poll for results
./local_file_step3_poll_result.sh "$BATCH_ID"
# Output: FULL_ZIP_URL=xxx

# Step 4: Download results
./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/
```

### Script Descriptions

#### local_file_step1_apply_upload_url.sh

Apply for upload URL and batch_id.

**Usage:**
```bash
./local_file_step1_apply_upload_url.sh <pdf_file_path> [language] [layout_model]
```

**Parameters:**
- `language`: `ch` (Chinese), `en` (English), `auto` (auto-detect), default `ch`
- `layout_model`: `doclayout_yolo` (fast), `layoutlmv3` (accurate), default `doclayout_yolo`

**Output:**
```
BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...
```

---

#### local_file_step2_upload_file.sh

Upload PDF file to the presigned URL.

**Usage:**
```bash
./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>
```

---

#### local_file_step3_poll_result.sh

Poll extraction results until completion or failure.

**Usage:**
```bash
./local_file_step3_poll_result.sh <batch_id> [max_retries] [retry_interval_seconds]
```

**Output:**
```
FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip
```

---

#### local_file_step4_download.sh

Download result ZIP and extract.

**Usage:**
```bash
./local_file_step4_download.sh <zip_url> [output_zip_filename] [extract_directory_name]
```

**Output Structure:**
```
extracted/
├── full.md              # 📄 Markdown document (main result)
├── images/              # 🖼️ Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data
```

### Detailed Documentation

📚 **Complete Guide**: See `docs/Local_File_Parsing_Guide.md`

---

## 🌐 Feature 2: Parse Online PDF Documents (URL Method)

For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.

### Quick Start

```bash
cd scripts/

# Step 1: Submit parsing task (provide URL directly)
./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf"
# Output: TASK_ID=xxx

# Step 2: Poll results and auto-download/extract
./online_file_step2_poll_result.sh "$TASK_ID" extracted/
```

### Script Descriptions

#### online_file_step1_submit_task.sh

Submit parsing task for online PDF.

**Usage:**
```bash
./online_file_step1_submit_task.sh <pdf_url> [language] [layout_model]
```

**Parameters:**
- `pdf_url`: Complete URL of the online PDF (required)
- `language`: `ch` (Chinese), `en` (English), `auto` (auto-detect), default `ch`
- `layout_model`: `doclayout_yolo` (fast), `layoutlmv3` (accurate), default `doclayout_yolo`

**Output:**
```
TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
```

---

#### online_file_step2_poll_result.sh

Poll extraction results, automatically download and extract when complete.

**Usage:**
```bash
./online_file_step2_poll_result.sh <task_id> [output_directory] [max_retries] [retry_interval_seconds]
```

**Output Structure:**
```
extracted/
├── full.md              # 📄 Markdown document (main result)
├── images/              # 🖼️ Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data
```

### Detailed Documentation

📚 **Complete Guide**: See `docs/Online_URL_Parsing_Guide.md`

---

## 📊 Comparison of Two Parsing Methods

| Feature | **Local PDF Parsing** | **Online PDF Parsing** |
|---------|----------------------|------------------------|
| **Steps** | 4 steps | 2 steps |
| **Upload Required** | ✅ Yes | ❌ No |
| **Average Time** | 30-60 seconds | 10-20 seconds |
| **Use Case** | Local files | Files already online (arXiv, websites, etc.) |
| **File Size Limit** | 200MB | Limited by source server |

---

## ⚙️ Advanced Usage

### Batch Process Local Files

```bash
for pdf in /path/to/pdfs/*.pdf; do
    echo "Processing: $pdf"
    
    # Step 1
    result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1)
    batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2)
    upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2)
    
    # Step 2
    ./local_file_step2_upload_file.sh "$upload_url" "$pdf"
    
    # Step 3
    zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2)
    
    # Step 4
    filename=$(basename "$pdf" .pdf)
    ./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted"
done
```

### Batch Process Online Files

```bash
for url in \
  "https://arxiv.org/pdf/2410.17247.pdf" \
  "https://arxiv.org/pdf/2409.12345.pdf"; do
    echo "Processing: $url"
    
    # Step 1
    result=$(./online_file_step1_submit_task.sh "$url" 2>&1)
    task_id=$(echo "$result" | grep TASK_ID | cut -d= -f2)
    
    # Step 2
    filename=$(basename "$url" .pdf)
    ./online_file_step2_poll_result.sh "$task_id" "${filename}_extracted"
done
```

---

## ⚠️ Notes

1. **Token Configuration**: Scripts prioritize `MINERU_TOKEN`, fall back to `MINERU_API_KEY` if not found
2. **Token Security**: Do not hard-code tokens in scripts; use environment variables
3. **URL Accessibility**: For online parsing, ensure the provided URL is publicly accessible
4. **File Limits**: Single file recommended not exceeding 200MB, maximum 600 pages
5. **Network Stability**: Ensure stable network when uploading large files
6. **Security**: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
7. **Optional jq**: Installing `jq` provides enhanced JSON parsing and additional security checks

---

## 📚 Reference Documentation

| Document | Description |
|----------|-------------|
| `docs/Local_File_Parsing_Guide.md` | Detailed curl commands and parameters for local PDF parsing |
| `docs/Online_URL_Parsing_Guide.md` | Detailed curl commands and parameters for online PDF parsing |

External Resources:
- 🏠 **MinerU Official**: https://mineru.net/
- 📖 **API Documentation**: https://mineru.net/apiManage/docs
- 💻 **GitHub Repository**: https://github.com/opendatalab/MinerU

---

*Skill Version: 1.0.0*  
*Release Date: 2026-02-18*  
*Community Skill - Not affiliated with MinerU official*

Related Skills

recipe-video-extractor

3891
from openclaw/skills

Extract a structured cooking recipe from a shared video URL when the user sends `recipe <url>`. Prioritize caption/description and comments via browser automation, then use web search/fetch as fallback with clear source attribution.

pdf-process-mineru

3891
from openclaw/skills

PDF document parsing tool based on local MinerU, supports converting PDF to Markdown, JSON, and other machine-readable formats.

invoice-extractor

3891
from openclaw/skills

Extract invoice information from images and PDF files using Baidu OCR API, export to Excel. Supports single file, multiple files, or entire directory processing. Use when the user mentions invoices, invoice recognition, extracting invoice data, processing receipts, converting invoices to Excel, or batch processing invoice files.

methodology-extractor

3891
from openclaw/skills

Batch extraction of experimental methods from multiple papers for protocol.

clinical-data-extractor

3891
from openclaw/skills

Extract clinical trial data from pharmaceutical conference websites or PDF documents. Use when user provides a URL or PDF file containing innovative drug clinical trial data and needs structured extraction of: drug name, manufacturer, indication, clinical phase, trial name, conference, efficacy and safety data (presented as tables), and markdown output to "药品名称@适应症.md" file.

terabox-link-extractor

3891
from openclaw/skills

Direct link extraction from TeraBox URLs using the XAPIverse protocol. Extracts high-speed download and stream links (All Resolutions) without browser session requirements. Use when the user provides a TeraBox link and wants to download or stream content directly.

pdf-figure-extractor

3891
from openclaw/skills

从PDF论文中精确提取Figure图片,自动分析PDF结构、定位caption位置、裁剪干净图形,并验证图片质量。支持学术新闻稿、论文写作等场景的自动化图片处理。

Web Content Extractor - 网页内容提取器

3891
from openclaw/skills

**版本**: 2.0

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891
from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation

find-skills

3891
from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

tavily-search

3891
from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research