arxiv-paper-extract
Extract, translate and save arXiv CS.CV papers for a specific date. Use when user asks to fetch arXiv papers, download paper lists, extract CV papers, translate paper titles to Chinese, or save paper metadata from arxiv.org/list/cs.CV.
Best use case
arxiv-paper-extract is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Extract, translate and save arXiv CS.CV papers for a specific date. Use when user asks to fetch arXiv papers, download paper lists, extract CV papers, translate paper titles to Chinese, or save paper metadata from arxiv.org/list/cs.CV.
Teams using arxiv-paper-extract should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/arxiv-paper-extract/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How arxiv-paper-extract Compares
| Feature / Agent | arxiv-paper-extract | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Extract, translate and save arXiv CS.CV papers for a specific date. Use when user asks to fetch arXiv papers, download paper lists, extract CV papers, translate paper titles to Chinese, or save paper metadata from arxiv.org/list/cs.CV.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# arXiv CS.CV Paper Extraction Skill
Extract papers from arXiv Computer Vision category for a specific date, translate titles to Chinese using a subagent, and save to JSON.
## Workflow
### Step 1: Download the arXiv page
```bash
curl -s "https://arxiv.org/list/cs.CV/recent?skip=0&show=2000" -o /tmp/arxiv_cs_cv.html
```
### Step 2: Parse papers for the target date
Use Python to extract papers. The HTML structure uses `<h3>` tags for date headers (e.g., `<h3>Thu, 25 Dec 2025`).
```python
import re
from html import unescape
with open('/tmp/arxiv_cs_cv.html', 'r', encoding='utf-8') as f:
html = f.read()
# Find the target date section
# Format: "<h3>Day, DD Mon YYYY" e.g. "<h3>Thu, 25 Dec 2025"
date_pattern = r'<h3>{weekday}, {day} {month} {year}'
# Example: r'<h3>Thu, 25 Dec 2025'
start_match = re.search(date_pattern, html)
if start_match:
start_pos = start_match.start()
# Find next date section (previous day)
remaining = html[start_pos + 10:]
end_match = re.search(r'<h3>\w+, \d+ \w+ \d{4}', remaining)
if end_match:
section = html[start_pos:start_pos + 10 + end_match.start()]
else:
section = html[start_pos:]
# Extract paper IDs and titles
id_pattern = r'<a href ="/abs/(\d+\.\d+)"'
title_pattern = r"<div class='list-title mathjax'>(.*?)</div>"
ids = re.findall(id_pattern, section)
titles_raw = re.findall(title_pattern, section, re.DOTALL)
papers = []
for paper_id, title_html in zip(ids, titles_raw):
title = re.sub(r'<[^>]+>', '', title_html)
title = re.sub(r'\s+', ' ', title).strip()
title = unescape(title)
# Remove "Title:" prefix if present
if title.startswith("Title:"):
title = title[6:].strip()
papers.append({
"id": f"https://arxiv.org/abs/{paper_id}",
"title_en": title,
"title_cn": ""
})
```
### Step 3: Save extracted papers to temporary JSON
Save the extracted papers first, then delegate translation to a subagent.
```python
import json
from pathlib import Path
# Format: YYYY-MM-DD.json (e.g., 2025-12-25.json)
# Convert month name to number: Dec -> 12
month_map = {"Jan": "01", "Feb": "02", "Mar": "03", "Apr": "04", "May": "05", "Jun": "06",
"Jul": "07", "Aug": "08", "Sep": "09", "Oct": "10", "Nov": "11", "Dec": "12"}
month_num = month_map["{month}"]
filename = f"{year}-{month_num}-{day:02d}.json"
output_path = Path.home() / "paper_list" / filename
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(papers, f, ensure_ascii=False, indent=2)
```
### Step 4: Translate titles using Task subagent (CRITICAL)
**MUST use the Task tool to spawn a `general` subagent for translation.** This is the recommended approach for batch translation tasks.
Call the Task tool with the following parameters:
```
Tool: Task
Parameters:
description: "Translate arXiv paper titles to Chinese"
subagent_type: "general"
prompt: |
Read the JSON file at ~/paper_list/{YYYY-MM-DD}.json containing arXiv paper metadata.
For each paper, translate the "title_en" field to Chinese and update the "title_cn" field.
Translation guidelines:
1. Keep model names as-is: SAM, SAM2, NeRF, CLIP, Transformer, Mamba, Diffusion, etc.
2. Keep acronyms unchanged: VLM, LLM, 3D, 2D, 6DoF, MRI, CT, GAN, CNN, RNN, etc.
3. Translate descriptive phrases accurately
4. Maintain colon separator for titled papers (use Chinese colon:)
5. Remove any "Title:" prefix from English titles before translating
Save the updated JSON back to the same file.
Return the count of translated papers and show 3-5 example translations.
```
## Output Format
File naming: `YYYY-MM-DD.json` (e.g., `2025-12-25.json`)
Location: `~/paper_list/`
```json
[
{
"id": "https://arxiv.org/abs/2512.21338",
"title_en": "HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming",
"title_cn": "HiStream:通过消除冗余的流式传输实现高效高分辨率视频生成"
}
]
```
## Date Format Reference
arXiv uses format: `{Weekday}, {Day} {Month} {Year}`
- Weekday: Mon, Tue, Wed, Thu, Fri
- Day: 1-31 (no leading zero)
- Month: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
- Year: 2025
Example: `Thu, 25 Dec 2025`
## Notes
- arXiv cs.CV typically has 50-150 papers per day
- The page includes both primary cs.CV papers and cross-listed papers
- Cross-listed papers appear after the main cs.CV section
- Ensure all papers are captured by checking the count matches the arXiv listing
- **Always use Task subagent for translation** - do not attempt to translate manually in the main workflowRelated Skills
extracting-ai-context
Extracts and manages AI context (skills, AGENTS.md) from workflow-kotlin library JARs. Use when setting up AI tooling for a workflow-kotlin project, updating skills after a library version change, or configuring agent-specific directories.
extracta-ai-automation
Automate Extracta AI tasks via Rube MCP (Composio). Always search tools first for current schemas.
email-extractor
Expert in email content extraction and analysis. **Use whenever the user mentions .eml files, email messages, says "Extract email information", "Using the email information", or requests to extract, parse, analyze, or process email files.** Handles email thread parsing, attachment extraction, and converting emails to structured markdown format for AI processing. (project, gitignored)
extract-page
Extract a single page from a PDF as a PNG image for quick preview.
article-extractor
Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter.
IMRAD Research Paper Scripting
Creates engaging, step-by-step video scripts explaining the 17 parts of a research paper in IMRAD format, tailored for animation and AI voiceover.
arxivterminal
CLI tool (arxivterminal) for fetching, searching, and managing arXiv papers locally. Use when working with arXiv papers using the arxivterminal command - fetching new papers by category, searching the local database, viewing papers from specific dates, or managing the local paper database.
arxiv-reader
arXiv 論文の内容を取得・要約するスキル。URL が arxiv.org/abs/{論文ID} 形式の場合に使用。PDF をダウンロードして Read ツールで読み取る。
arxiv-search
Search arXiv preprint repository for papers in physics, mathematics, computer science, quantitative biology, and related fields
arxiv-mcp
Search and retrieve academic papers from arXiv.org using WebFetch and Exa. No MCP server required - uses existing tools to access arXiv API directly.
adr-decision-extraction
Extract architectural decisions from conversations. Identifies problem-solution pairs, trade-off discussions, and explicit choices. Use when analyzing session transcripts for ADR generation.
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.