arxiv-paper-processor

Process and analyze arXiv papers systematically for research workflows

191 stars

Best use case

arxiv-paper-processor is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Process and analyze arXiv papers systematically for research workflows

Teams using arxiv-paper-processor should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/arxiv-paper-processor/SKILL.md --create-dirs "https://raw.githubusercontent.com/wentorai/research-plugins/main/skills/literature/search/arxiv-paper-processor/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/arxiv-paper-processor/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How arxiv-paper-processor Compares

Feature / Agentarxiv-paper-processorStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Process and analyze arXiv papers systematically for research workflows

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# arXiv Paper Processor

## Overview

The arXiv Paper Processor skill provides a complete pipeline for downloading, parsing, and analyzing arXiv papers programmatically. While the arXiv API provides metadata, researchers often need to work with the full text—extracting sections, equations, figures, and references for deeper analysis.

This skill covers the entire processing chain: retrieving papers by ID or search query, downloading PDF and LaTeX source files, extracting structured content, and producing analysis-ready outputs. It is particularly valuable for researchers conducting large-scale literature analysis, building training datasets from academic text, or automating evidence extraction for systematic reviews.

The pipeline handles common challenges in academic PDF processing including multi-column layouts, mathematical notation, table extraction, and reference parsing. It integrates with tools like GROBID for PDF parsing and can work directly with arXiv LaTeX sources for higher-fidelity extraction.

## Paper Retrieval and Download

### Fetching by arXiv ID

The most reliable method is to fetch papers by their arXiv identifier:

```python
import urllib.request
import feedparser

# Fetch metadata via Atom feed
arxiv_id = "2301.07041"
url = f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
response = urllib.request.urlopen(url)
feed = feedparser.parse(response.read())

entry = feed.entries[0]
title = entry.title
abstract = entry.summary
authors = [a.name for a in entry.authors]
pdf_url = entry.links[1].href  # PDF link
```

### Downloading Source Files

arXiv stores LaTeX source files for most papers. These provide much richer structure than PDFs:

```bash
# Download LaTeX source (typically a .tar.gz)
wget https://arxiv.org/e-print/2301.07041 -O paper_source.tar.gz
tar -xzf paper_source.tar.gz -C paper_source/
```

Source files contain the original `.tex` files, figures, bibliography files, and any custom style files. Parsing LaTeX directly gives you access to section structure, equations in their original notation, citation keys, and figure captions without the ambiguity of PDF extraction.

### Batch Download Guidelines

When downloading multiple papers, respect arXiv's usage policies:

- Limit requests to 1 per 3 seconds for API calls
- Use the arXiv bulk data access (S3 or GCS) for large-scale processing (1000+ papers)
- Cache all downloaded files locally and check before re-downloading
- Include a descriptive User-Agent header in your HTTP requests

## Content Extraction Pipeline

### PDF Extraction with GROBID

For papers where only PDF is available, use GROBID (GeneRation Of BIbliographic Data) for structured extraction:

```bash
# Run GROBID as a local service
docker run --rm -p 8070:8070 grobid/grobid:0.8.0

# Process a PDF
curl -X POST "http://localhost:8070/api/processFulltextDocument" \
  -F "input=@paper.pdf" \
  -F "consolidateHeader=1" \
  -F "consolidateCitations=1" \
  > paper_tei.xml
```

GROBID outputs TEI-XML with structured sections including:
- Header metadata (title, authors, affiliations, abstract)
- Body text with section hierarchy
- Equations (as MathML or raw text)
- Figure and table references
- Parsed bibliography entries with DOIs where available

### LaTeX Source Parsing

When LaTeX source is available, parse it directly for higher fidelity:

1. Identify the main `.tex` file (look for `\documentclass` or `\begin{document}`)
2. Resolve `\input{}` and `\include{}` directives to build the complete document
3. Extract sections using `\section{}`, `\subsection{}` markers
4. Extract equations from `equation`, `align`, `gather` environments
5. Parse `\cite{}` commands and cross-reference with the `.bib` file
6. Extract figure captions from `\caption{}` commands

### Structured Output Schema

Produce a standardized JSON output for each processed paper:

```json
{
  "arxiv_id": "2301.07041",
  "title": "Paper Title",
  "authors": ["Author One", "Author Two"],
  "abstract": "...",
  "sections": [
    {"heading": "Introduction", "level": 1, "text": "..."},
    {"heading": "Related Work", "level": 1, "text": "..."}
  ],
  "equations": ["E = mc^2", "..."],
  "figures": [{"id": "fig1", "caption": "..."}],
  "references": [{"key": "smith2020", "title": "...", "doi": "..."}],
  "processed_date": "2026-03-10"
}
```

## Analysis and Integration

Once papers are processed into structured format, several downstream analyses become possible:

- **Section-level search**: Search across the methods sections of hundreds of papers to find specific techniques.
- **Equation extraction**: Build a database of mathematical formulations used in your subfield.
- **Citation graph construction**: Map which papers cite which, using extracted reference lists.
- **Terminology tracking**: Monitor how specific terms evolve in usage frequency over time.
- **Dataset identification**: Extract mentions of datasets and benchmarks from experimental sections.

Integrate processed outputs with your reference manager by generating BibTeX entries enriched with extracted metadata, or feed structured JSON into a local search index for full-text retrieval across your paper collection.

## References

- arXiv API: https://info.arxiv.org/help/api/index.html
- GROBID: https://github.com/kermitt2/grobid
- GPT Paper Assistant: https://github.com/tatsu-lab/gpt_paper_assistant
- arXiv bulk data access: https://info.arxiv.org/help/bulk_data/index.html

Related Skills

elegant-paper-template

191
from wentorai/research-plugins

Beautiful LaTeX template for working papers and technical reports

conference-paper-template

191
from wentorai/research-plugins

Templates and formatting guides for major academic conference submissions

paper-polish-guide

191
from wentorai/research-plugins

Review and polish LaTeX research papers for clarity and style

research-paper-writer

191
from wentorai/research-plugins

Guide for writing formal academic papers following IEEE and ACM standards

ml-paper-writing

191
from wentorai/research-plugins

Write ML/AI research papers targeting NeurIPS, ICML, and ICLR venues

papersgpt-zotero-guide

191
from wentorai/research-plugins

AI plugin for Zotero with ChatGPT, Claude, and DeepSeek support

paper-parse-guide

191
from wentorai/research-plugins

Deep dual-mode reading of academic papers from PDF or URL sources

scientify-write-review-paper

191
from wentorai/research-plugins

Write literature reviews and survey papers from collected papers

paper-reading-assistant

191
from wentorai/research-plugins

AI-assisted paper reading, PDF Q&A, and summarization workflows

paper-critique-framework

191
from wentorai/research-plugins

Structured framework for writing peer review reports and paper critiques

paper-review-skills

191
from wentorai/research-plugins

8 peer review skills. Trigger: reviewing manuscripts, comparing papers, quality assessment. Design: systematic review criteria, evaluation rubrics, and automated review tools.

research-paper-kb

191
from wentorai/research-plugins

Build a persistent cross-session knowledge base from academic papers