arxiv-doc-builder

Automatically convert arXiv papers to well-structured Markdown documentation. Invoke with an arXiv ID to fetch materials (LaTeX source or PDF), convert to Markdown, and generate implementation-ready reference documentation with preserved mathematics and section structure.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

arxiv-doc-builder is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using arxiv-doc-builder should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/arxiv-doc-builder/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/documentation/arxiv-doc-builder/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/arxiv-doc-builder/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How arxiv-doc-builder Compares

Feature / Agent	arxiv-doc-builder	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# arXiv Document Builder

Automatically converts arXiv papers into structured Markdown documentation for implementation reference.

## Capabilities

This skill automatically:

1. **Fetches paper materials from arXiv**
   - Attempts to download LaTeX source first (preferred for accuracy)
   - Falls back to PDF if source is unavailable
   - Handles all HTTP requests, extraction, and directory setup

2. **Converts to structured Markdown**
   - LaTeX source → Markdown via pandoc (preserves all math and structure)
   - PDF → Markdown via text extraction with multiple conversion modes:
     - Simple single-column conversion (default)
     - Full double-column conversion for academic papers
     - Page-wise extraction with mixed column support
   - Preserves mathematical formulas in MathJax/LaTeX format (`$...$`, `$$...$$`)
   - Maintains section hierarchy and document structure
   - Includes abstracts, figures, and references

3. **Generates implementation-ready documentation**
   - Output saved to `papers/{ARXIV_ID}/{ARXIV_ID}.md`
   - Easy to reference during code implementation
   - Optimized for Claude to read and understand

## When to Use This Skill

Invoke this skill when the user requests:
- "Convert arXiv paper {ID} to markdown"
- "Fetch and process paper {ID}"
- "Create documentation for arXiv:{ID}"
- "I need to read/reference paper {ID}"

## How It Works

### Single Entry Point

Use the main orchestrator script which handles everything automatically:

```bash
python scripts/convert_paper.py ARXIV_ID [--output-dir DIR]
```

The orchestrator:
1. Calls `fetch_paper.py` to download materials (with automatic source→PDF fallback)
2. Detects available format (LaTeX source or PDF)
3. Calls the appropriate converter (`convert_latex.py` or `convert_pdf_simple.py`)
4. Outputs structured Markdown to `papers/{ARXIV_ID}/{ARXIV_ID}.md`

All HTTP requests (curl), file extraction (tar), and directory creation (mkdir) are handled automatically.

### Automatic Source Detection and Fallback

The fetcher tries LaTeX source first, then PDF:
- **LaTeX source available**: Downloads `.tar.gz`, extracts to `papers/{ID}/source/`, converts with pandoc
- **PDF only**: Downloads PDF to `papers/{ID}/pdf/`, extracts text with pdfplumber

No manual intervention needed—the skill handles format detection and fallback automatically.

## Output Structure

Generated Markdown includes:
- Title, authors, and abstract
- Full paper content with section hierarchy
- Inline math: `$f(x) = x^2$`
- Display math: `$$\int_0^\infty e^{-x} dx = 1$$`
- Preserved LaTeX commands for complex formulas
- References section

Output location: `papers/{ARXIV_ID}/{ARXIV_ID}.md`

## PDF Conversion Scripts

Three specialized scripts for direct PDF conversion:

### convert_pdf_simple.py

Convert all pages as single-column layout.

```bash
uv run convert_pdf_simple.py paper.pdf -o output.md
```

### convert_pdf_double_column.py

Convert all pages as double-column layout (for academic papers).

```bash
uv run convert_pdf_double_column.py paper.pdf -o output.md
```

### convert_pdf_extract.py

Extract specific pages with optional double-column processing.

```bash
# Extract specific pages
uv run convert_pdf_extract.py paper.pdf --pages 1-5,10 -o output.md

# Extract with mixed column layouts
uv run convert_pdf_extract.py paper.pdf --pages 1-10 --double-column-pages 3-7 -o output.md
```

**Note:** `--double-column-pages` must be a subset of `--pages`. Invalid page ranges cause immediate error.

### Architecture

All three scripts share common conversion logic through `pdf_converter_lib.py`, ensuring consistent behavior while keeping each script focused on its specific use case.

## Advanced: Vision-Based PDF Conversion

For papers with complex mathematical formulas where text extraction fails, a vision-based approach is available as a manual fallback:

```bash
# Generate high-resolution images from PDF
python scripts/convert_pdf_with_vision.py paper.pdf --dpi 300 --columns 2
```

This creates page images (with optional column splitting) that can be read manually with Claude's vision capabilities for maximum accuracy. This is NOT part of the automatic workflow—use it only when automatic conversion produces poor results.

See [references/pdf-conversion.md](references/pdf-conversion.md) for details on vision-based conversion.

## Directory Structure

```
papers/
└── {ARXIV_ID}/
    ├── source/           # LaTeX source files (if available)
    ├── pdf/              # PDF file
    ├── {ARXIV_ID}.md     # Generated Markdown output
    └── figures/          # Extracted figures (if any)
```

Related Skills

workflow-builder

from diegosouzapw/awesome-omni-skill

Design automation workflows and pipelines. Use when creating CI/CD, task automation, or process flows.

Docker Image Builder Skill

from diegosouzapw/awesome-omni-skill

Transform Docker knowledge from Lessons 1-6 into a reusable AI skill for consistent, production-ready containerization

cicd-pipeline-builder

from diegosouzapw/awesome-omni-skill

Generate CI/CD pipelines for GitHub Actions, GitLab CI, Jenkins with best practices

azure-data-api-builder

from diegosouzapw/awesome-omni-skill

Deploy Data API Builder (DAB) to Azure Container Apps with Azure SQL, Azure Container Registry (ACR), and Azure Developer CLI (azd). Produces Bicep templates, Dockerfile, and azure.yaml. Use when asked to deploy DAB to Azure, create Bicep for DAB, or set up cloud API hosting.

arxiv

from diegosouzapw/awesome-omni-skill

Fetch and summarize arXiv papers. Search by topic, read specific papers by ID or URL, and get plain-language summaries. Use when the user mentions arXiv, asks about research papers, wants to find recent academic work on a topic, or is discussing algorithmic or architectural choices that could benefit from literature review.

arxiv-research

from diegosouzapw/awesome-omni-skill

Download and analyze academic papers from arXiv. Use when users want to download a specific paper by ID (e.g., "download paper arxiv:1234.5678") or read/analyze papers they've already downloaded.

web-artifacts-builder

from diegosouzapw/awesome-omni-skill

Suite of tools for creating elaborate, multi-component claude.ai HTML artifacts using modern frontend web technologies (React, Tailwind CSS, shadcn/ui). Use for complex artifacts requiring state ma...

testing-strategy-builder

from diegosouzapw/awesome-omni-skill

Use this skill when creating comprehensive testing strategies for applications. Provides test planning templates, coverage targets, test case structures, and guidance for unit, integration, E2E, and performance testing. Ensures robust quality assurance across the development lifecycle.

testing-builder

from diegosouzapw/awesome-omni-skill

Automatically generates comprehensive test suites (unit, integration, E2E) based on code and past testing patterns. Use when user says "write tests", "test this", "add coverage", or after fixing bugs to create regression tests. Eliminates testing friction for ADHD users.

spec-builder

from diegosouzapw/awesome-omni-skill

Transform vague product or feature ideas into concrete, detailed specification documents through an interactive interview process. Use when the user wants to flesh out an idea, create a spec, write requirements, plan a product/feature/prototype, or go from "I have this idea..." to a concrete document. Works for software products, physical products, services, or any concept that needs specification.

slack-bot-builder

from diegosouzapw/awesome-omni-skill

Build Slack apps using the Bolt framework across Python, JavaScript, and Java. Covers Block Kit for rich UIs, interactive components, slash commands, event handling, OAuth installation flows, and W...

quickcreator-skill-builder

from diegosouzapw/awesome-omni-skill

Develop, maintain, and publish skills for the QuickCreator platform. Use when the user wants to list, search, fork, create, update, publish, or delete QuickCreator skills, or when working with the QuickCreator skill marketplace and skill lifecycle management.