doc-cleaner

Convert PDF, DOCX, XLSX, and text files to clean, structured Markdown. CJK-friendly, table-friendly, privacy-first.

162 stars
Complexity: easy

About this skill

The doc-cleaner skill provides a robust solution for transforming various document formats into a consistent and highly structured Markdown. It intelligently extracts content from PDFs, Word documents (DOCX), Excel spreadsheets (XLSX), and text files, then cleans and structures it into Markdown, making it suitable for a wide array of downstream applications. Users can opt for a fast, local conversion without AI, or leverage integrated AI models like Gemini, Groq, or Ollama to enhance the structuring and cleaning process, leading to more semantically organized output. This flexibility ensures privacy, as users can choose to keep all processing on their local machine. Its capabilities include handling complex tables and supporting CJK (Chinese, Japanese, Korean) characters, making it a versatile tool for global users. This skill is particularly valuable for automating document processing workflows, enabling easy data extraction for analysis, standardizing content for knowledge bases, or simply preparing documents for version control systems. Its command-line interface makes it seamlessly integrable into AI agent workflows for automated task execution.

Best use case

The primary use case for doc-cleaner is to streamline the conversion of disparate document formats into a unified, structured Markdown. This is incredibly useful for researchers, data analysts, content managers, and developers who need to extract, clean, and standardize information from various sources for analysis, publication, or system input. It simplifies the often tedious process of manual data extraction and formatting.

Convert PDF, DOCX, XLSX, and text files to clean, structured Markdown. CJK-friendly, table-friendly, privacy-first.

Users can expect clean, structured Markdown files derived from their input documents, potentially enhanced by AI, along with an optional machine-readable JSON summary of the processing results.

Practical example

Example input

Convert the `quarterly_report.pdf` file to clean Markdown, using AI to help structure the content. Save the output to a new directory named 'markdown_reports'.

Example output

```json
{"version":"1.0.0","total":1,"success":1,"failed":0,"files":[{"file":"quarterly_report.pdf","output":"./markdown_reports/quarterly_report.md","status":"ok"}]}
```

When to use this skill

  • When a user asks to convert a document (PDF, DOCX, XLSX, TXT) to Markdown.
  • When there's a need to extract text or tables from structured or semi-structured documents.
  • When processing financial documents or other sensitive information, with an option for privacy-focused local conversion.
  • When automating the conversion of a batch of documents within a directory.

When not to use this skill

  • When the desired output format is not Markdown (e.g., HTML, JSON, or the original format).
  • When the user needs to *edit* the original document, rather than convert its content.
  • When dealing with highly unstructured or purely image-based documents without embedded text that require advanced OCR beyond basic PDF text extraction.
  • When complex formatting or layout preservation is more critical than structured content extraction.

How doc-cleaner Compares

Feature / Agentdoc-cleanerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

Convert PDF, DOCX, XLSX, and text files to clean, structured Markdown. CJK-friendly, table-friendly, privacy-first.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# doc-cleaner

Convert documents (PDF, DOCX, XLSX, TXT) to clean, structured Markdown.

## When to use

- User asks to convert a document to Markdown
- User wants to extract text or tables from PDF/DOCX/XLSX files
- User wants to clean up bank statements or financial documents
- User asks to process a batch of documents in a directory

## Commands

### Convert a single file (no AI, fastest)
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai none
```

### Convert a single file with AI structuring
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai gemini
```

### Convert a single file with Groq structuring
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai groq
```

### Convert all files in a directory
```bash
python3 {baseDir}/cleaner.py --input "{{directory}}" --ai none --output-dir "{{output_dir}}"
```

### Preview without writing (dry run)
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --dry-run --verbose
```

### Get machine-readable result summary
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai none --summary
```

The `--summary` flag prints a JSON summary to stdout after processing:
```json
{"version":"1.0.0","total":3,"success":2,"failed":1,"files":[{"file":"report.pdf","output":"./output/report.md","status":"ok"},{"file":"scan.pdf","output":null,"status":"no_content"},{"file":"data.xlsx","output":"./output/data.md","status":"ok"}]}
```

## Options

| Flag | Description |
|---|---|
| `--input, -i` | File or directory to process (required, non-recursive) |
| `--output-dir, -o` | Output directory (default: ./output) |
| `--ai` | `gemini`, `groq`, `ollama`, or `none` (default: from config or gemini) |
| `--password` | PDF decryption password |
| `--config` | Path to config JSON |
| `--summary` | Print JSON summary to stdout after processing |
| `--dry-run` | Preview without writing files |
| `--verbose` | Enable debug logging |

## Supported formats

PDF (native, scanned, encrypted), DOCX, XLSX, XLS, CSV, TXT, MD

## Exit codes

| Code | Meaning |
|---|---|
| 0 | All files processed successfully |
| 1 | Some files failed (partial success) |
| 2 | No processable files found or config error |

## Notes

- Output defaults to `./output/` relative to current directory
- For scanned PDFs, AI mode (`gemini`, `groq`, or `ollama`) gives much better results
- `--ai none` requires zero API keys and zero network access
- CJK encoding (Big5, CP950, UTF-16) is auto-detected
- Tables in DOCX and XLSX are preserved as Markdown pipe tables

Related Skills

visa-doc-translate

144923
from affaan-m/everything-claude-code

将签证申请文件(图片)翻译成英文,并创建包含原文和译文的双语PDF

Document ProcessingClaude

writer

31392
from sickn33/antigravity-awesome-skills

Document creation, format conversion (ODT/DOCX/PDF), mail merge, and automation with LibreOffice Writer.

Document ProcessingClaude

latex-paper-conversion

31392
from sickn33/antigravity-awesome-skills

This skill should be used when the user asks to convert an academic paper in LaTeX from one format (e.g., Springer, IPOL) to another format (e.g., MDPI, IEEE, Nature). It automates extraction, injection, fixing formatting, and compiling.

Document ProcessingClaude

docx-official

31392
from sickn33/antigravity-awesome-skills

A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.

Document ProcessingClaude

ai-slop-cleaner

22262
from Yeachan-Heo/oh-my-claudecode

Clean AI-generated code slop with a regression-safe, deletion-first workflow and optional reviewer-only mode

ai-slop-cleaner

18419
from Yeachan-Heo/oh-my-codex

Run an anti-slop cleanup/refactor/deslop workflow

mac-cleaner

3891
from openclaw/skills

Analyze and safely clean disk space on macOS. Use when the user asks about Mac storage, "System Data" taking too much space, disk cleanup, freeing up space, or managing storage on macOS. Covers caches, iOS simulators, Xcode data, trash, logs, and browser caches. Safe for everyday Mac users.

apple-photos-cleaner

3891
from openclaw/skills

Analyze, clean up, and organize Apple Photos libraries. Find and report junk photos (screenshots, low-quality, burst leftovers, duplicates), analyze storage usage, generate photo timeline recaps, plan smart exports, analyze Live Photos, check iCloud sync, audit shared libraries, detect similar photos, curate seasonal highlights, and score face quality. All analysis operations are READ-ONLY on the database (safe). macOS only. Requires Python 3.9+ (stdlib only) and access to the Apple Photos SQLite database. Trigger on: Photos cleanup, photo storage, duplicate photos, junk photos, screenshot cleanup, Photos analysis, photo timeline, photo export, Photos library stats, burst cleanup, storage hogs, photo organization, Live Photos, iCloud sync, shared library, similar photos, seasonal highlights, face quality, portraits.

clinical-data-cleaner

3880
from openclaw/skills

Use when cleaning clinical trial data, preparing data for FDA/EMA submission, standardizing SDTM datasets, handling missing values in clinical studies, detecting outliers in lab results, or converting raw CRF data to CDISC format. Cleans and standardizes clinical trial data for regulatory compliance with audit trails.

mole-mac-cleaner

3823
from openclaw/skills

Deep clean and optimize your Mac using the Mole CLI tool

ai-slop-cleaner

422
from vibeeval/vibecosystem

Post-implementation cleanup that removes AI-generated bloat while preserving functionality. Runs pass-by-pass with test verification after each pass. Activate after kraken/spark complete a feature, or when a codebase needs hygiene work.

openclaw-session-cleaner

420
from chujianyun/skills

OpenClaw session 清理助手。用于用户提到清理 OpenClaw sessions、删除旧 cron session、压缩或重建 sessions.json、排查 session 文件膨胀时使用。触发后优先检查 ~/.openclaw/agents/main/sessions/ 下的 session 文件数量和 sessions.json 大小,并按指令执行清理。