word-parsing

用于Word文档解析的Word解析原子 skill，适用于通用行业文档解析场景。

105 stars

byaifinlab

View on GitHub Installation ↓

Best use case

word-parsing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

用于Word文档解析的Word解析原子 skill，适用于通用行业文档解析场景。

Teams using word-parsing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/word-parsing/SKILL.md --create-dirs "https://raw.githubusercontent.com/aifinlab/FinClaw/main/skills/archive/word-parsing/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/word-parsing/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How word-parsing Compares

Feature / Agent	word-parsing	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

用于Word文档解析的Word解析原子 skill，适用于通用行业文档解析场景。

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Word解析 Skill

## 数据来源

本 Skill 支持多种Word文档输入格式，核心数据来源包括：

### 1. Word文档类型
- **.doc格式**：Microsoft Word 97-2003文档
- **.docx格式**：Microsoft Word 2007及以上版本文档
- **.rtf格式**：富文本格式文档
- **.odt格式**：OpenDocument文本格式

### 2. 文档内容类型
- **金融报告**：年报、季报、研究报告、公告
- **合同协议**：贷款合同、担保合同、投资协议
- **法律文档**：法律意见书、合规文件
- **业务文档**：业务说明、产品介绍

### 3. 文档特征
- **语言类型**：中文、英文、多语言混合
- **版式类型**：标准版式、复杂版式、扫描版式
- **文档大小**：小文档（<10MB）、大文档（>10MB）

### 4. 数据格式要求
- **文件路径**：本地文件路径或网络文件URL
- **文件编码**：UTF-8、GBK、GB2312等
- **文件权限**：需要读取权限

> 说明：本 Skill 不包含文档采集功能，需要用户提供Word文档文件。建议文档格式规范，以便进行准确的解析。

---

## 功能

本 Skill 提供全面的Word文档解析能力，涵盖多种解析功能：

### 1. 文本提取
- **正文提取**：提取文档正文内容
- **标题提取**：提取各级标题
- **段落提取**：提取段落内容
- **列表提取**：提取有序列表和无序列表

### 2. 结构识别
- **章节层级**：识别文档的章节层级结构
- **目录识别**：识别和提取文档目录
- **页眉页脚**：提取页眉页脚内容
- **脚注尾注**：提取脚注和尾注内容

### 3. 表格识别
- **表格提取**：识别和提取表格内容
- **表格结构**：识别表格的行列结构
- **表格格式**：保留表格的格式信息
- **表格定位**：记录表格在文档中的位置

### 4. 图表识别
- **图片提取**：提取文档中的图片
- **图表识别**：识别图表类型和内容
- **图表定位**：记录图表在文档中的位置
- **图表描述**：生成图表的文字描述

### 5. 格式信息提取
- **字体信息**：提取字体、字号、颜色等信息
- **段落格式**：提取段落对齐、缩进等信息
- **样式信息**：提取文档样式信息
- **元数据提取**：提取文档属性、作者、创建时间等

### 6. 高级处理功能
- **OCR识别**：对扫描版Word文档进行OCR识别
- **多语言识别**：识别文档中的多语言内容
- **版式还原**：尽可能还原文档的原始版式
- **结构化输出**：输出结构化的文档内容

---

## 使用示例

### 输出示例
```json
{
  "document_info": {
    "filename": "document.docx",
    "file_size": 1024000,
    "page_count": 25,
    "language": "zh-CN",
    "created_date": "2024-01-15",
    "modified_date": "2024-03-20"
  },
  "structure": {
    "title": "2024年度报告",
    "sections": [
      {
        "level": 1,
        "title": "第一章 公司概况",
        "content": "公司概况内容...",
        "page": 1,
        "subsections": [
          {
            "level": 2,
            "title": "1.1 公司基本信息",
            "content": "基本信息内容...",
            "page": 1
          }
        ]
      }
    ]
  },
  "tables": [
    {
      "table_id": 1,
      "position": {
        "page": 5,
        "section": "第二章"
      },
      "rows": 10,
      "columns": 5,
      "data": [
        ["项目", "2024年", "2023年", "2022年", "2021年"],
        ["营业收入", "1000", "900", "800", "700"]
      ]
    }
  ],
  "images": [
    {
      "image_id": 1,
      "position": {
        "page": 8,
        "section": "第三章"
      },
      "format": "png",
      "size": [800, 600]
    }
  ],
  "metadata": {
    "author": "张三",
    "company": "示例公司",
    "keywords": ["年报", "财务报告"]
  }
}
```

---

## 注意事项与限制

### 1. 文档格式要求
- 支持标准Word格式文档
- 复杂版式可能影响解析准确性
- 扫描版文档需要OCR功能支持

### 2. 解析准确性
- 文本提取准确率较高
- 表格识别对复杂表格可能有限
- 图表识别需要图片质量良好

### 3. 性能考虑
- 大文档处理可能需要较长时间
- 内存占用与文档大小成正比
- 建议对超大文档进行分块处理

### 4. 编码问题
- 需要正确识别文档编码
- 特殊字符可能影响解析结果
- 建议使用UTF-8编码

### 5. 使用限制
- 本 Skill 不包含文档编辑功能
- 解析结果需要人工复核
- 受保护文档可能无法解析

---

## 参考资料
- 见 references/ 目录中的相关文档，包括：
  - Word文档解析方法手册
  - 表格识别算法说明
  - OCR识别使用指南
  - 性能优化指南

## License
- 本 skill 代码部分采用 MIT License，详见 `LICENSE` 文件
- 依赖与运行环境以 `requirements.txt` 为准
- 文档内容采用 CC BY 4.0 许可