multimodal-parser

Unified multi-modal content parser for images, PDF, DOCX, audio, auto OCR/transcription, output structured text for LLM processing

3,891 stars

Best use case

multimodal-parser is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Unified multi-modal content parser for images, PDF, DOCX, audio, auto OCR/transcription, output structured text for LLM processing

Teams using multimodal-parser should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/multimodal-parser/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/ayalili/multimodal-parser/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/multimodal-parser/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How multimodal-parser Compares

Feature / Agentmultimodal-parserStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Unified multi-modal content parser for images, PDF, DOCX, audio, auto OCR/transcription, output structured text for LLM processing

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# 📄 多模态内容解析器
## 核心亮点
1. 🔄 **统一接口**:一套API支持图片/PDF/Word/音频4大类格式解析,不需要对接多个服务
2. 🚀 **开箱即用**:内置OCR、音频转文字、文档解析能力,零配置即可使用
3. 📝 **多格式输出**:支持纯文本/Markdown/结构化JSON三种输出格式,适配不同LLM处理需求
4. 💡 **友好错误提示**:依赖缺失时自动给出安装命令,新手也能快速上手

## 🎯 适用场景
- 多模态Agent的内容解析层
- 文档问答、知识库构建场景的文件预处理
- 图片OCR识别、语音转文字需求
- 批量文档解析与结构化处理

## 📝 参数说明
| 参数 | 类型 | 必填 | 默认值 | 说明 |
|------|------|------|--------|------|
| file_path | string | 是 | - | 要解析的文件路径 |
| file_type | string | 否 | auto | 文件类型:image/pdf/docx/audio/auto |
| output_format | string | 否 | text | 输出格式:text/markdown/structured |
| options.ocr_lang | string | 否 | chi_sim+eng | OCR识别语言 |
| options.audio_model | string | 否 | base | Whisper模型大小(base/small/medium/large) |
| options.pdf_page_range | tuple | 否 | undefined | PDF解析页码范围,如[1, 10]表示解析第1-10页 |

## 💡 开箱即用示例
### 图片OCR识别
```typescript
const result = await skills.multimodalParser({
  file_path: "./resume.jpg",
  file_type: "image",
  output_format: "markdown"
});
```

### PDF解析(指定页码范围)
```typescript
const result = await skills.multimodalParser({
  file_path: "./document.pdf",
  output_format: "structured",
  options: {
    pdf_page_range: [1, 50] // 只解析前50页
  }
});
```

### 音频转文字
```typescript
const result = await skills.multimodalParser({
  file_path: "./meeting.mp3",
  options: { 
    audio_model: "small" // 用small模型,速度更快
  }
});
```

## 🔧 依赖安装
根据需要解析的文件类型安装对应依赖:
```bash
# 全量安装所有依赖(推荐)
## macOS
brew install tesseract tesseract-lang poppler pandoc
pip install openai-whisper ffmpeg

## Ubuntu/Debian
apt install tesseract-ocr tesseract-ocr-chi-sim poppler-utils pandoc ffmpeg
pip install openai-whisper
```

## 技术实现说明
- 基于成熟的开源工具链(Tesseract/Poppler/Whisper/Pandoc)
- 自动文件类型检测,无需手动指定格式
- 模块化设计,可轻松扩展支持更多格式
- 输出格式标准化,直接可被LLM处理

Related Skills

content-parser

3891
from openclaw/skills

Extract and parse content from URLs. Triggers on: user provides a URL to extract content from, another skill needs to parse source material, "parse this URL", "extract content", "解析链接", "提取内容".

Data & Research

resume-parser

3891
from openclaw/skills

智能简历解析系统,支持PDF/Word/图片格式简历的结构化信息提取、岗位匹配度分析、优化建议生成。完全本地运行,无需外部API。使用场景:(1) 解析上传的简历文件提取核心信息,(2) 输入岗位JD计算简历匹配度,(3) 生成简历优化建议,(4) 导出结构化简历数据。

document-parser

3891
from openclaw/skills

高精度文档解析技能,从 PDF、图片、Word 文档中提取结构化数据。

pdf-parser

3891
from openclaw/skills

使用 MinerU API 将 PDF 解析为 Markdown,支持公式、表格、OCR。提供本地文件和在线 URL 两种解析方式。触发条件:(1) 用户说"解析 PDF [路径]",(2) 用户说"将 PDF 转为 Markdown",(3) 在 paper-workflow 中自动调用。使用场景:学术论文解析、文档提取、知识库构建。

Name: unidoc_parser

3891
from openclaw/skills

Description: Parse documents using UniDoc API for conversion to Markdown or JSON format. Supports both synchronous and asynchronous parsing with automatic status polling.

Name: u2-doc-parser

3891
from openclaw/skills

Description: Parse documents using UniDoc API for conversion to Markdown or JSON format. Supports both synchronous and asynchronous parsing with automatic status polling.

clinicaltrials-gov-parser

3880
from openclaw/skills

Monitor and summarize competitor clinical trial status changes from ClinicalTrials.gov. Trigger: When user asks to track clinical trials, monitor trial status changes, get updates on specific trials, or analyze competitor trial activities. Use cases: Pharma competitive intelligence, trial monitoring, status tracking, recruitment updates, completion alerts.

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891
from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation

find-skills

3891
from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

tavily-search

3891
from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research

baidu-search

3891
from openclaw/skills

Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Data & Research