doc-cleaner
Convert PDF, DOCX, XLSX, and text files to clean, structured Markdown. CJK-friendly, table-friendly, privacy-first.
About this skill
The doc-cleaner skill provides a robust solution for transforming various document formats into a consistent and highly structured Markdown. It intelligently extracts content from PDFs, Word documents (DOCX), Excel spreadsheets (XLSX), and text files, then cleans and structures it into Markdown, making it suitable for a wide array of downstream applications. Users can opt for a fast, local conversion without AI, or leverage integrated AI models like Gemini, Groq, or Ollama to enhance the structuring and cleaning process, leading to more semantically organized output. This flexibility ensures privacy, as users can choose to keep all processing on their local machine. Its capabilities include handling complex tables and supporting CJK (Chinese, Japanese, Korean) characters, making it a versatile tool for global users. This skill is particularly valuable for automating document processing workflows, enabling easy data extraction for analysis, standardizing content for knowledge bases, or simply preparing documents for version control systems. Its command-line interface makes it seamlessly integrable into AI agent workflows for automated task execution.
Best use case
The primary use case for doc-cleaner is to streamline the conversion of disparate document formats into a unified, structured Markdown. This is incredibly useful for researchers, data analysts, content managers, and developers who need to extract, clean, and standardize information from various sources for analysis, publication, or system input. It simplifies the often tedious process of manual data extraction and formatting.
Convert PDF, DOCX, XLSX, and text files to clean, structured Markdown. CJK-friendly, table-friendly, privacy-first.
Users can expect clean, structured Markdown files derived from their input documents, potentially enhanced by AI, along with an optional machine-readable JSON summary of the processing results.
Practical example
Example input
Convert the `quarterly_report.pdf` file to clean Markdown, using AI to help structure the content. Save the output to a new directory named 'markdown_reports'.
Example output
```json
{"version":"1.0.0","total":1,"success":1,"failed":0,"files":[{"file":"quarterly_report.pdf","output":"./markdown_reports/quarterly_report.md","status":"ok"}]}
```When to use this skill
- When a user asks to convert a document (PDF, DOCX, XLSX, TXT) to Markdown.
- When there's a need to extract text or tables from structured or semi-structured documents.
- When processing financial documents or other sensitive information, with an option for privacy-focused local conversion.
- When automating the conversion of a batch of documents within a directory.
When not to use this skill
- When the desired output format is not Markdown (e.g., HTML, JSON, or the original format).
- When the user needs to *edit* the original document, rather than convert its content.
- When dealing with highly unstructured or purely image-based documents without embedded text that require advanced OCR beyond basic PDF text extraction.
- When complex formatting or layout preservation is more critical than structured content extraction.
How doc-cleaner Compares
| Feature / Agent | doc-cleaner | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | easy | N/A |
Frequently Asked Questions
What does this skill do?
Convert PDF, DOCX, XLSX, and text files to clean, structured Markdown. CJK-friendly, table-friendly, privacy-first.
How difficult is it to install?
The installation complexity is rated as easy. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
Best AI Skills for ChatGPT
Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
SKILL.md Source
# doc-cleaner
Convert documents (PDF, DOCX, XLSX, TXT) to clean, structured Markdown.
## When to use
- User asks to convert a document to Markdown
- User wants to extract text or tables from PDF/DOCX/XLSX files
- User wants to clean up bank statements or financial documents
- User asks to process a batch of documents in a directory
## Commands
### Convert a single file (no AI, fastest)
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai none
```
### Convert a single file with AI structuring
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai gemini
```
### Convert a single file with Groq structuring
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai groq
```
### Convert all files in a directory
```bash
python3 {baseDir}/cleaner.py --input "{{directory}}" --ai none --output-dir "{{output_dir}}"
```
### Preview without writing (dry run)
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --dry-run --verbose
```
### Get machine-readable result summary
```bash
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai none --summary
```
The `--summary` flag prints a JSON summary to stdout after processing:
```json
{"version":"1.0.0","total":3,"success":2,"failed":1,"files":[{"file":"report.pdf","output":"./output/report.md","status":"ok"},{"file":"scan.pdf","output":null,"status":"no_content"},{"file":"data.xlsx","output":"./output/data.md","status":"ok"}]}
```
## Options
| Flag | Description |
|---|---|
| `--input, -i` | File or directory to process (required, non-recursive) |
| `--output-dir, -o` | Output directory (default: ./output) |
| `--ai` | `gemini`, `groq`, `ollama`, or `none` (default: from config or gemini) |
| `--password` | PDF decryption password |
| `--config` | Path to config JSON |
| `--summary` | Print JSON summary to stdout after processing |
| `--dry-run` | Preview without writing files |
| `--verbose` | Enable debug logging |
## Supported formats
PDF (native, scanned, encrypted), DOCX, XLSX, XLS, CSV, TXT, MD
## Exit codes
| Code | Meaning |
|---|---|
| 0 | All files processed successfully |
| 1 | Some files failed (partial success) |
| 2 | No processable files found or config error |
## Notes
- Output defaults to `./output/` relative to current directory
- For scanned PDFs, AI mode (`gemini`, `groq`, or `ollama`) gives much better results
- `--ai none` requires zero API keys and zero network access
- CJK encoding (Big5, CP950, UTF-16) is auto-detected
- Tables in DOCX and XLSX are preserved as Markdown pipe tablesRelated Skills
visa-doc-translate
将签证申请文件(图片)翻译成英文,并创建包含原文和译文的双语PDF
writer
Document creation, format conversion (ODT/DOCX/PDF), mail merge, and automation with LibreOffice Writer.
latex-paper-conversion
This skill should be used when the user asks to convert an academic paper in LaTeX from one format (e.g., Springer, IPOL) to another format (e.g., MDPI, IEEE, Nature). It automates extraction, injection, fixing formatting, and compiling.
docx-official
A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.
ai-slop-cleaner
Clean AI-generated code slop with a regression-safe, deletion-first workflow and optional reviewer-only mode
ai-slop-cleaner
Run an anti-slop cleanup/refactor/deslop workflow
mac-cleaner
Analyze and safely clean disk space on macOS. Use when the user asks about Mac storage, "System Data" taking too much space, disk cleanup, freeing up space, or managing storage on macOS. Covers caches, iOS simulators, Xcode data, trash, logs, and browser caches. Safe for everyday Mac users.
apple-photos-cleaner
Analyze, clean up, and organize Apple Photos libraries. Find and report junk photos (screenshots, low-quality, burst leftovers, duplicates), analyze storage usage, generate photo timeline recaps, plan smart exports, analyze Live Photos, check iCloud sync, audit shared libraries, detect similar photos, curate seasonal highlights, and score face quality. All analysis operations are READ-ONLY on the database (safe). macOS only. Requires Python 3.9+ (stdlib only) and access to the Apple Photos SQLite database. Trigger on: Photos cleanup, photo storage, duplicate photos, junk photos, screenshot cleanup, Photos analysis, photo timeline, photo export, Photos library stats, burst cleanup, storage hogs, photo organization, Live Photos, iCloud sync, shared library, similar photos, seasonal highlights, face quality, portraits.
clinical-data-cleaner
Use when cleaning clinical trial data, preparing data for FDA/EMA submission, standardizing SDTM datasets, handling missing values in clinical studies, detecting outliers in lab results, or converting raw CRF data to CDISC format. Cleans and standardizes clinical trial data for regulatory compliance with audit trails.
mole-mac-cleaner
Deep clean and optimize your Mac using the Mole CLI tool
ai-slop-cleaner
Post-implementation cleanup that removes AI-generated bloat while preserving functionality. Runs pass-by-pass with test verification after each pass. Activate after kraken/spark complete a feature, or when a codebase needs hygiene work.
openclaw-session-cleaner
OpenClaw session 清理助手。用于用户提到清理 OpenClaw sessions、删除旧 cron session、压缩或重建 sessions.json、排查 session 文件膨胀时使用。触发后优先检查 ~/.openclaw/agents/main/sessions/ 下的 session 文件数量和 sessions.json 大小,并按指令执行清理。