arxiv-paper-extract

Extract, translate and save arXiv CS.CV papers for a specific date. Use when user asks to fetch arXiv papers, download paper lists, extract CV papers, translate paper titles to Chinese, or save paper metadata from arxiv.org/list/cs.CV.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

arxiv-paper-extract is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using arxiv-paper-extract should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/arxiv-paper-extract/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/arxiv-paper-extract/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/arxiv-paper-extract/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How arxiv-paper-extract Compares

Feature / Agent	arxiv-paper-extract	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# arXiv CS.CV Paper Extraction Skill

Extract papers from arXiv Computer Vision category for a specific date, translate titles to Chinese using a subagent, and save to JSON.

## Workflow

### Step 1: Download the arXiv page

```bash
curl -s "https://arxiv.org/list/cs.CV/recent?skip=0&show=2000" -o /tmp/arxiv_cs_cv.html
```

### Step 2: Parse papers for the target date

Use Python to extract papers. The HTML structure uses `<h3>` tags for date headers (e.g., `<h3>Thu, 25 Dec 2025`).

```python
import re
from html import unescape

with open('/tmp/arxiv_cs_cv.html', 'r', encoding='utf-8') as f:
    html = f.read()

# Find the target date section
# Format: "<h3>Day, DD Mon YYYY" e.g. "<h3>Thu, 25 Dec 2025"
date_pattern = r'<h3>{weekday}, {day} {month} {year}'
# Example: r'<h3>Thu, 25 Dec 2025'

start_match = re.search(date_pattern, html)
if start_match:
    start_pos = start_match.start()
    # Find next date section (previous day)
    remaining = html[start_pos + 10:]
    end_match = re.search(r'<h3>\w+, \d+ \w+ \d{4}', remaining)
    if end_match:
        section = html[start_pos:start_pos + 10 + end_match.start()]
    else:
        section = html[start_pos:]

# Extract paper IDs and titles
id_pattern = r'<a href ="/abs/(\d+\.\d+)"'
title_pattern = r"<div class='list-title mathjax'>(.*?)</div>"

ids = re.findall(id_pattern, section)
titles_raw = re.findall(title_pattern, section, re.DOTALL)

papers = []
for paper_id, title_html in zip(ids, titles_raw):
    title = re.sub(r'<[^>]+>', '', title_html)
    title = re.sub(r'\s+', ' ', title).strip()
    title = unescape(title)
    # Remove "Title:" prefix if present
    if title.startswith("Title:"):
        title = title[6:].strip()
    papers.append({
        "id": f"https://arxiv.org/abs/{paper_id}",
        "title_en": title,
        "title_cn": ""
    })
```

### Step 3: Save extracted papers to temporary JSON

Save the extracted papers first, then delegate translation to a subagent.

```python
import json
from pathlib import Path

# Format: YYYY-MM-DD.json (e.g., 2025-12-25.json)
# Convert month name to number: Dec -> 12
month_map = {"Jan": "01", "Feb": "02", "Mar": "03", "Apr": "04", "May": "05", "Jun": "06",
             "Jul": "07", "Aug": "08", "Sep": "09", "Oct": "10", "Nov": "11", "Dec": "12"}
month_num = month_map["{month}"]
filename = f"{year}-{month_num}-{day:02d}.json"

output_path = Path.home() / "paper_list" / filename
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(papers, f, ensure_ascii=False, indent=2)
```

### Step 4: Translate titles using Task subagent (CRITICAL)

**MUST use the Task tool to spawn a `general` subagent for translation.** This is the recommended approach for batch translation tasks.

Call the Task tool with the following parameters:

```
Tool: Task
Parameters:
  description: "Translate arXiv paper titles to Chinese"
  subagent_type: "general"
  prompt: |
    Read the JSON file at ~/paper_list/{YYYY-MM-DD}.json containing arXiv paper metadata.
    
    For each paper, translate the "title_en" field to Chinese and update the "title_cn" field.
    
    Translation guidelines:
    1. Keep model names as-is: SAM, SAM2, NeRF, CLIP, Transformer, Mamba, Diffusion, etc.
    2. Keep acronyms unchanged: VLM, LLM, 3D, 2D, 6DoF, MRI, CT, GAN, CNN, RNN, etc.
    3. Translate descriptive phrases accurately
    4. Maintain colon separator for titled papers (use Chinese colon：)
    5. Remove any "Title:" prefix from English titles before translating
    
    Save the updated JSON back to the same file.
    
    Return the count of translated papers and show 3-5 example translations.
```

## Output Format

File naming: `YYYY-MM-DD.json` (e.g., `2025-12-25.json`)
Location: `~/paper_list/`

```json
[
  {
    "id": "https://arxiv.org/abs/2512.21338",
    "title_en": "HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming",
    "title_cn": "HiStream：通过消除冗余的流式传输实现高效高分辨率视频生成"
  }
]
```

## Date Format Reference

arXiv uses format: `{Weekday}, {Day} {Month} {Year}`
- Weekday: Mon, Tue, Wed, Thu, Fri
- Day: 1-31 (no leading zero)
- Month: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
- Year: 2025

Example: `Thu, 25 Dec 2025`

## Notes

- arXiv cs.CV typically has 50-150 papers per day
- The page includes both primary cs.CV papers and cross-listed papers
- Cross-listed papers appear after the main cs.CV section
- Ensure all papers are captured by checking the count matches the arXiv listing
- **Always use Task subagent for translation** - do not attempt to translate manually in the main workflow

Related Skills

extracting-ai-context

from diegosouzapw/awesome-omni-skill

Extracts and manages AI context (skills, AGENTS.md) from workflow-kotlin library JARs. Use when setting up AI tooling for a workflow-kotlin project, updating skills after a library version change, or configuring agent-specific directories.

extracta-ai-automation

from diegosouzapw/awesome-omni-skill

Automate Extracta AI tasks via Rube MCP (Composio). Always search tools first for current schemas.

email-extractor

from diegosouzapw/awesome-omni-skill

Expert in email content extraction and analysis. **Use whenever the user mentions .eml files, email messages, says "Extract email information", "Using the email information", or requests to extract, parse, analyze, or process email files.** Handles email thread parsing, attachment extraction, and converting emails to structured markdown format for AI processing. (project, gitignored)

extract-page

from diegosouzapw/awesome-omni-skill

Extract a single page from a PDF as a PNG image for quick preview.

article-extractor

from diegosouzapw/awesome-omni-skill

Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter.

IMRAD Research Paper Scripting

from diegosouzapw/awesome-omni-skill

Creates engaging, step-by-step video scripts explaining the 17 parts of a research paper in IMRAD format, tailored for animation and AI voiceover.

arxivterminal

from diegosouzapw/awesome-omni-skill

CLI tool (arxivterminal) for fetching, searching, and managing arXiv papers locally. Use when working with arXiv papers using the arxivterminal command - fetching new papers by category, searching the local database, viewing papers from specific dates, or managing the local paper database.

arxiv-reader

from diegosouzapw/awesome-omni-skill

arXiv 論文の内容を取得・要約するスキル。URL が arxiv.org/abs/{論文ID} 形式の場合に使用。PDF をダウンロードして Read ツールで読み取る。

arxiv-search

from diegosouzapw/awesome-omni-skill

Search arXiv preprint repository for papers in physics, mathematics, computer science, quantitative biology, and related fields

arxiv-mcp

from diegosouzapw/awesome-omni-skill

Search and retrieve academic papers from arXiv.org using WebFetch and Exa. No MCP server required - uses existing tools to access arXiv API directly.

adr-decision-extraction

from diegosouzapw/awesome-omni-skill

Extract architectural decisions from conversations. Identifies problem-solution pairs, trade-off discussions, and explicit choices. Use when analyzing session transcripts for ADR generation.

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development