pdf-smart-split

Intelligent PDF splitting skill. Split multi-page scanned PDF documents into separate files based on content analysis. Features: (1) Convert PDF to high-quality images (2) Analyze and intelligently name images by content (3) Auto-detect and remove duplicate pages (4) Identify and reorder scrambled pages (5) Merge related images into separate PDFs by content (6) Generate summary reports. Use for: bond filing documents, application materials, contracts and agreements that need intelligent content-based splitting.

16 stars

Best use case

pdf-smart-split is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Intelligent PDF splitting skill. Split multi-page scanned PDF documents into separate files based on content analysis. Features: (1) Convert PDF to high-quality images (2) Analyze and intelligently name images by content (3) Auto-detect and remove duplicate pages (4) Identify and reorder scrambled pages (5) Merge related images into separate PDFs by content (6) Generate summary reports. Use for: bond filing documents, application materials, contracts and agreements that need intelligent content-based splitting.

Teams using pdf-smart-split should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/pdf-smart-split/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/content-media/pdf-smart-split/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/pdf-smart-split/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How pdf-smart-split Compares

Feature / Agentpdf-smart-splitStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Intelligent PDF splitting skill. Split multi-page scanned PDF documents into separate files based on content analysis. Features: (1) Convert PDF to high-quality images (2) Analyze and intelligently name images by content (3) Auto-detect and remove duplicate pages (4) Identify and reorder scrambled pages (5) Merge related images into separate PDFs by content (6) Generate summary reports. Use for: bond filing documents, application materials, contracts and agreements that need intelligent content-based splitting.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# PDF Smart Split Skill

Split multi-page scanned PDF documents into separate files based on content analysis. Supports duplicate detection, page reordering, and intelligent naming.

## Workflow

### 1. PDF to Images
```python
from pdf2image import convert_from_path
images = convert_from_path(pdf_path, dpi=200, thread_count=4)
```
- DPI=200 for quality
- Multi-threading for speed

### 2. Duplicate Detection
```python
import hashlib
def get_image_hash(path):
    with open(path, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()
```
- Use MD5 hash comparison
- Record duplicates for summary

### 3. Content Analysis and Naming
Use `view_file` tool to analyze each image:
- File type (cover, content, signature page)
- Document category (application, agreement, statement)
- Page sequence number

Naming format: `sequence_filetype_description.jpg`

### 4. Page Reordering
Analyze content continuity:
- Check table sequence numbers
- Check chapter numbering progression
- Ensure signature pages follow content

Reorder by logical content, not physical order.

### 5. Merge Images to PDF
```python
from PIL import Image
def images_to_pdf(image_paths, output_pdf):
    images = [Image.open(p).convert('RGB') for p in image_paths]
    images[0].save(output_pdf, "PDF", save_all=True, append_images=images[1:])
```
- Merge signature pages with their content
- Combine continuous content into single files

### 6. Generate Summary
Summary should include:
- Source file info (name, original pages)
- Duplicate removal records (if any)
- Page reorder records (if any)
- Split file list (name, original pages, page count, description)
- Output directory paths

## Script Usage

### Full Processing
```bash
python scripts/convert_dedup.py <pdf_path> [output_dir]
python scripts/merge_images.py <image1> <image2> ... <output.pdf>
```

## Output Structure
```
filename_images/     # Intelligently named images
filename_split_PDF/  # Content-grouped PDFs
```

## Requirements
- Install: `pdf2image`, `Pillow`
- Windows needs `poppler`: `conda install -c conda-forge poppler`

Related Skills

error-diagnostics-smart-debug

16
from diegosouzapw/awesome-omni-skill

Use when working with error diagnostics smart debug

Smart Contracts

16
from diegosouzapw/awesome-omni-skill

Smart contracts are self-executing programs on blockchain. This guide covers Solidity basics, contract deployment, interaction, and frontend integration for building decentralized applications with au

bgo

10
from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

large-data-with-dask

16
from diegosouzapw/awesome-omni-skill

Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.

langsmith-fetch

16
from diegosouzapw/awesome-omni-skill

Debug LangChain and LangGraph agents by fetching execution traces from LangSmith Studio. Use when debugging agent behavior, investigating errors, analyzing tool calls, checking memory operations, or examining agent performance. Automatically fetches recent traces and analyzes execution patterns. Requires langsmith-fetch CLI installed.

langchain-tool-calling

16
from diegosouzapw/awesome-omni-skill

How chat models call tools - includes bind_tools, tool choice strategies, parallel tool calling, and tool message handling

langchain-notes

16
from diegosouzapw/awesome-omni-skill

LangChain 框架学习笔记 - 快速查找概念、代码示例和最佳实践。包含 Core components、Middleware、Advanced usage、Multi-agent patterns、RAG retrieval、Long-term memory 等主题。当用户询问 LangChain、Agent、RAG、向量存储、工具使用、记忆系统时使用此 Skill。

langchain-js

16
from diegosouzapw/awesome-omni-skill

Builds LLM-powered applications with LangChain.js for chat, agents, and RAG. Use when creating AI applications with chains, memory, tools, and retrieval-augmented generation in JavaScript.

langchain-agents

16
from diegosouzapw/awesome-omni-skill

Expert guidance for building LangChain agents with proper tool binding, memory, and configuration. Use when creating agents, configuring models, or setting up tool integrations in LangConfig.

lang-python

16
from diegosouzapw/awesome-omni-skill

Python 3.13+ development specialist covering FastAPI, Django, async patterns, data science, testing with pytest, and modern Python features. Use when developing Python APIs, web applications, data pipelines, or writing tests.

kramme:agents-md

16
from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "update AGENTS.md", "add to AGENTS.md", "maintain agent docs", or needs to add guidelines to agent instructions. Guides discovery of local skills and enforces structured, keyword-based documentation style.

kontent-ai-automation

16
from diegosouzapw/awesome-omni-skill

Automate Kontent AI tasks via Rube MCP (Composio). Always search tools first for current schemas.