arxiv-latex-source

Download and parse LaTeX source files from arXiv preprints

191 stars

Best use case

arxiv-latex-source is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Download and parse LaTeX source files from arXiv preprints

Teams using arxiv-latex-source should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/arxiv-latex-source/SKILL.md --create-dirs "https://raw.githubusercontent.com/wentorai/research-plugins/main/skills/literature/fulltext/arxiv-latex-source/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/arxiv-latex-source/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How arxiv-latex-source Compares

Feature / Agent	arxiv-latex-source	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Download and parse LaTeX source files from arXiv preprints

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# arXiv LaTeX Source Access Guide

## Overview

arXiv stores the original LaTeX source files for the vast majority of its 2.4 million+ preprints. Accessing LaTeX source provides major advantages over PDF parsing: exact mathematical notation as written by the author, structured sections and labels, machine-readable bibliography entries, and intact figure captions, table data, and cross-references.

For formula extraction, citation graph construction, section-level text analysis, or training data curation for scientific language models, LaTeX source is the gold standard. PDF parsing introduces OCR errors in equations, loses structural hierarchy, and mangles complex tables.

The e-print endpoint serves source bundles as gzip-compressed tarballs (`.tar.gz`) containing `.tex` files, figures, `.bib`/`.bbl` bibliography files, style files, and supplementary materials. No authentication is required.

## Authentication

No authentication or API key is required. The e-print endpoint is publicly accessible. However, arXiv asks that automated tools set a descriptive `User-Agent` header and comply with rate limits.

## Core Endpoints

### Download LaTeX Source

- **URL**: `GET https://arxiv.org/e-print/{arxiv_id}`
- **Response**: `application/gzip` — a `.tar.gz` archive containing the source files
- **Parameters**:
  | Param | Type | Required | Description |
  |-------|------|----------|-------------|
  | arxiv_id | string | Yes | arXiv identifier, e.g. `2301.00001` or `2301.00001v2` for a specific version |

- **Example**:
  ```bash
  # Download source archive (response: 200, application/gzip, ~1.3 MB)
  curl -sL -o source.tar.gz "https://arxiv.org/e-print/2301.00001"

  # List archive contents
  tar tz -f source.tar.gz | head -10
  # ACM-Reference-Format.bbx
  # ACM-Reference-Format.bst
  # Image_1.jpg
  # README.txt
  # acmart.cls
  ```

- **Content-Disposition header**: `attachment; filename="arXiv-2301.00001v1.tar.gz"`
- **ETag**: SHA-256 hash provided for caching: `sha256:f1ffe8ec...`

### Format Detection

The endpoint almost always returns a gzip-compressed tar archive. Rare cases (very old or single-file submissions) may return a single gzip-compressed `.tex` file without tar wrapper. Always verify format before extracting:

```bash
curl -sL "https://arxiv.org/e-print/{arxiv_id}" -o source.gz
file source.gz  # "gzip compressed data, was 'XXXX.tar', ..."
```

### Metadata API (Companion)

Pair source downloads with the arXiv Atom API for structured metadata:

- **URL**: `GET https://export.arxiv.org/api/query?id_list={arxiv_id}`
- **Response**: Atom XML with `<title>`, `<author>`, `<summary>`, `<category>`, `<published>`
- **Example**: `curl -s "https://export.arxiv.org/api/query?id_list=2301.00001"`

## LaTeX Source Parsing Guide

### Locating the Main .tex File

A source archive typically contains multiple files. To find the main document:

1. Look for `\documentclass` in `.tex` files — this marks the root document
2. Check for a `README.txt` that may specify the main file
3. If multiple `.tex` files contain `\documentclass`, prefer the one with `\begin{document}`

```python
import tarfile, re

def find_main_tex(tar_path):
    with tarfile.open(tar_path, 'r:gz') as tar:
        tex_files = [m for m in tar.getmembers() if m.name.endswith('.tex')]
        for member in tex_files:
            content = tar.extractfile(member).read().decode('utf-8', errors='ignore')
            if r'\documentclass' in content and r'\begin{document}' in content:
                return member.name, content
    return None, None
```

### Extracting Sections

LaTeX sections follow a predictable hierarchy:

```python
import re

def extract_sections(tex_content):
    pattern = r'\\(section|subsection|subsubsection)\{([^}]+)\}'
    sections = re.findall(pattern, tex_content)
    return [(level, title) for level, title in sections]
    # [('section', 'Introduction'), ('section', 'Related Work'), ...]
```

### Extracting Equations

```python
def extract_equations(tex_content):
    patterns = [
        r'\\\[(.+?)\\\]',
        r'\\begin\{equation\}(.+?)\\end\{equation\}',
        r'\\begin\{align\*?\}(.+?)\\end\{align\*?\}',
    ]
    equations = []
    for pat in patterns:
        equations.extend(re.findall(pat, tex_content, re.DOTALL))
    return equations
```

### Extracting Bibliography

Parse `.bib` files (BibTeX entries) or `.bbl` files (compiled `\bibitem` commands):

```python
def extract_bibliography(tar_path):
    refs = []
    with tarfile.open(tar_path, 'r:gz') as tar:
        for member in tar.getmembers():
            if member.name.endswith('.bib'):
                content = tar.extractfile(member).read().decode('utf-8', errors='ignore')
                refs.extend(re.findall(r'@\w+\{([^,]+),(.+?)\n\}', content, re.DOTALL))
            elif member.name.endswith('.bbl'):
                content = tar.extractfile(member).read().decode('utf-8', errors='ignore')
                refs.extend(re.findall(r'\\bibitem.*?\{(.+?)\}', content))
    return refs
```

## Rate Limits

- **Maximum**: 4 requests per second for automated access
- **Recommended**: 1 request/second with delays between sequential downloads
- **Bulk access**: For 1000+ papers, use the arXiv S3 bulk data mirror instead
- **HTTP 429**: Rate limit exceeded; implement exponential backoff
- **User-Agent**: Required — set a descriptive string: `MyTool/1.0 (mailto:user@university.edu)`
- Persistent abuse may result in IP-level blocks

## Academic Use Cases

- **Formula extraction for ML training** — Build equation datasets with ground-truth LaTeX notation, free of OCR noise from PDF parsing
- **Citation network analysis** — Parse `.bib`/`.bbl` files for exact reference keys to construct citation graphs
- **Section-level text analysis** — Extract specific sections (e.g., all "Related Work" across a subfield) for systematic reviews
- **Reproducibility auditing** — Examine algorithm environments, hyperparameter tables, and methodology sections
- **Cross-paper notation alignment** — Compare and normalize equation environments across papers in a subfield

## Complete Python Example

```python
import requests, tarfile, io, re, time, gzip

def download_arxiv_source(arxiv_id, delay=1.0):
    """Download and extract all .tex files from an arXiv paper's source."""
    url = f"https://arxiv.org/e-print/{arxiv_id}"
    headers = {"User-Agent": "ResearchTool/1.0 (mailto:user@example.com)"}
    resp = requests.get(url, headers=headers)
    resp.raise_for_status()
    time.sleep(delay)

    buf = io.BytesIO(resp.content)
    try:
        with tarfile.open(fileobj=buf, mode='r:gz') as tar:
            return {m.name: tar.extractfile(m).read().decode('utf-8', errors='ignore')
                    for m in tar.getmembers() if m.name.endswith('.tex') and m.isfile()}
    except tarfile.ReadError:
        buf.seek(0)
        return {"main.tex": gzip.decompress(buf.read()).decode('utf-8', errors='ignore')}

# Usage
sources = download_arxiv_source("2301.00001")
for fname, content in sources.items():
    if r'\documentclass' in content:
        sections = re.findall(r'\\section\{([^}]+)\}', content)
        equations = re.findall(r'\\begin\{equation\}(.+?)\\end\{equation\}', content, re.DOTALL)
        print(f"{fname}: {len(sections)} sections, {len(equations)} equations")
```

## References

- arXiv e-print access: https://info.arxiv.org/help/bulk_data_s3.html
- arXiv API documentation: https://info.arxiv.org/help/api/index.html
- arXiv terms of use: https://info.arxiv.org/help/api/tou.html
- arXiv S3 bulk data: https://info.arxiv.org/help/bulk_data_s3.html

Related Skills

latex-templates-collection

191

from wentorai/research-plugins

Collection of LaTeX templates for papers, presentations, and CVs

latex-ecosystem-guide

191

from wentorai/research-plugins

Comprehensive guide to LaTeX editors, packages, and typesetting workflows

latex-drawing-guide

191

from wentorai/research-plugins

TikZ and PGFPlots techniques for publication-quality scientific figures

latex-drawing-collection

191

from wentorai/research-plugins

LaTeX drawing examples for Bayesian networks, tensors, and diagrams

academic-writing-latex

191

from wentorai/research-plugins

LaTeX-based academic writing assistant for thesis and paper templates

latex-skills

191

from wentorai/research-plugins

11 latex skills. Trigger: LaTeX typesetting, formatting papers, mathematical notation, Beamer. Design: template-based guides with package recommendations and compilation tips.