unstructured-document-ingestion

用于PDF/Word/邮件接入的非结构化文档接入原子 skill，适用于通用行业数据接入场景。

105 stars

Best use case

unstructured-document-ingestion is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

用于PDF/Word/邮件接入的非结构化文档接入原子 skill，适用于通用行业数据接入场景。

Teams using unstructured-document-ingestion should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/unstructured-document-ingestion/SKILL.md --create-dirs "https://raw.githubusercontent.com/aifinlab/FinClaw/main/skills/archive/unstructured-document-ingestion/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/unstructured-document-ingestion/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How unstructured-document-ingestion Compares

Feature / Agent	unstructured-document-ingestion	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

用于PDF/Word/邮件接入的非结构化文档接入原子 skill，适用于通用行业数据接入场景。

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 非结构化文档接入 Skill

## 数据来源

本 Skill 支持多种非结构化文档输入格式，核心数据来源包括：

### 1. 文档类型
- **PDF文档**：PDF格式的文档
- **Word文档**：Word格式的文档
- **邮件文档**：邮件文件、邮件附件
- **其他文档**：文本文件、RTF文件等

### 2. 文档来源
- **本地文件**：本地文件系统中的文档
- **网络文件**：网络URL的文档
- **邮件系统**：邮件系统的邮件和附件
- **文档库**：文档管理系统中的文档

### 3. 文档特征
- **文档大小**：小文档（<1MB）、大文档（>100MB）
- **文档格式**：标准格式、非标准格式
- **文档语言**：中文、英文、多语言混合
- **文档质量**：清晰文档、扫描文档、低质量文档

### 4. 数据格式要求
- **文件路径**：本地文件路径或网络文件URL
- **文件格式**：PDF、Word、邮件等格式
- **文件编码**：UTF-8、GBK、GB2312等
- **文件权限**：需要读取权限

> 说明：本 Skill 不包含文档采集功能，需要用户提供非结构化文档文件。建议文档格式规范，以便进行准确的文档接入。

---

## 功能

本 Skill 提供全面的非结构化文档接入能力，涵盖多种接入功能：

### 1. 文档读取
- **PDF读取**：读取PDF文档内容
- **Word读取**：读取Word文档内容
- **邮件读取**：读取邮件内容和附件
- **文件格式识别**：自动识别文件格式

### 2. 文档解析
- **文本提取**：提取文档中的文本内容
- **结构识别**：识别文档的结构和格式
- **元数据提取**：提取文档的元数据信息
- **内容分段**：将文档内容分段处理

### 3. 文档转换
- **格式转换**：转换文档格式
- **编码转换**：转换文档编码
- **结构转换**：转换文档结构
- **内容标准化**：标准化文档内容

### 4. 文档索引
- **内容索引**：建立文档内容索引
- **元数据索引**：建立文档元数据索引
- **全文检索**：支持全文检索功能
- **分类标签**：为文档添加分类标签

### 5. 文档验证
- **文档完整性验证**：验证文档的完整性
- **文档可读性验证**：验证文档的可读性
- **文档质量评估**：评估文档的质量
- **文档格式验证**：验证文档格式

### 6. 高级处理功能
- **批量处理**：批量处理多个文档
- **增量处理**：增量处理新文档
- **文档去重**：识别和去除重复文档
- **接入报告**：生成文档接入报告

---

## 使用示例

### 输出示例
```json
{
  "source_info": {
    "source_type": "file_system",
    "source_path": "/documents",
    "document_count": 100
  },
  "ingestion_config": {
    "supported_formats": ["pdf", "docx", "doc"],
    "extract_text": true,
    "extract_metadata": true,
    "index_content": true
  },
  "ingestion_results": {
    "total_documents": 100,
    "successful_documents": 95,
    "failed_documents": 5,
    "ingestion_time": "2024-03-15T10:00:00",
    "duration": "300s"
  },
  "document_samples": [
    {
      "document_id": "DOC001",
      "filename": "annual_report.pdf",
      "file_size": 5120000,
      "file_type": "pdf",
      "page_count": 200,
      "text_length": 50000,
      "metadata": {
        "title": "2024年度报告",
        "author": "示例公司",
        "created_date": "2024-03-01",
        "modified_date": "2024-03-10"
      },
      "extraction_status": "success",
      "indexed": true
    }
  ],
  "statistics": {
    "documents_processed": 100,
    "documents_indexed": 95,
    "total_text_extracted": 5000000,
    "average_processing_time": "3s",
    "success_rate": 0.95
  }
}
```

---

## 注意事项与限制

### 1. 文档格式要求
- 标准格式文档接入准确率较高
- 非标准格式可能影响接入
- 扫描文档需要OCR支持

### 2. 文档解析准确性
- 清晰文档解析准确率较高
- 模糊文档可能影响解析
- 复杂格式可能需要特殊处理

### 3. 文档大小
- 小文档处理速度较快
- 大文档可能需要较长时间
- 超大文档可能需要分段处理

### 4. 文档质量
- 高质量文档接入效果较好
- 低质量文档可能影响接入
- 需要预处理低质量文档

### 5. 使用限制
- 本 Skill 不包含文档编辑功能
- 接入结果需要人工复核
- 复杂文档可能需要人工处理

---

## 参考资料
- 见 references/ 目录中的相关文档，包括：
  - 非结构化文档接入方法手册
  - PDF/Word解析指南
  - 文档索引说明
  - 性能优化指南

Related Skills

structured-data-ingestion

105

from aifinlab/FinClaw

用于表/API/DB接入的结构化数据接入原子 skill，适用于通用行业数据接入场景。

semi-structured-data-ingestion

105

from aifinlab/FinClaw

用于Excel/表单接入的半结构化数据接入原子 skill，适用于通用行业数据接入场景。

realtime-stream-ingestion

105

from aifinlab/FinClaw

用于行情/交易/事件流的实时流数据接入原子 skill，适用于通用行业数据接入场景。

fundraising-document-comparison-assistant

105

from aifinlab/FinClaw

募资文件对比助手，适用于合规审查、内部复核、投资决策、版本管理等场景。以下情况请主动触发此技能： - 用户提供了两份或多份文件/文本，需要找出差异（即使没有说"对比"） - 用户说"新版改了哪些"、"帮我看看变化"、"和旧版有什么不同"、"哪里有调整" - 用户提到：募集说明书/基金合同/合伙协议/定增方案/资管计划/推介材料的修订、更新、调整 - 用户描述了某些条款变化，需要判断影响（即使只有一份文件） - 用户需要整理成合规报告、审批材料、内部复核意见 - 用户问"这个变化重要吗"、"需要重新报备吗"、"这样改合规吗" 不要等用户明确说"文件对比"——只要涉及募资/基金/资管/融资文件的版本变化、条款调整、差异识别，就应主动启动此技能。