scientific-papers-to-dataset
Build structured datasets from academic papers. Use when the user wants to extract structured data from scientific literature, traverse citation graphs, search OpenAlex for papers, or create datasets from PDFs for research purposes.
Best use case
scientific-papers-to-dataset is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Build structured datasets from academic papers. Use when the user wants to extract structured data from scientific literature, traverse citation graphs, search OpenAlex for papers, or create datasets from PDFs for research purposes.
Teams using scientific-papers-to-dataset should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/scientific-papers-to-dataset/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How scientific-papers-to-dataset Compares
| Feature / Agent | scientific-papers-to-dataset | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Build structured datasets from academic papers. Use when the user wants to extract structured data from scientific literature, traverse citation graphs, search OpenAlex for papers, or create datasets from PDFs for research purposes.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# scientific-papers-to-dataset
Build datasets by extracting structured data from academic papers and traversing citation graphs.
## When to Use This Skill
Use this skill when the user wants to:
- Create a dataset from academic papers
- Extract structured information from PDFs
- Search for papers on a topic using OpenAlex
- Traverse citation graphs to find related papers
## Architecture: Subagent Pattern
> [!IMPORTANT]
> Use **subagents** for PDF download, relevance checking, data extraction, and citation traversal to keep the main context clean.
### Recommended Subagents
1. **pdf-downloader** - Downloads PDF for a paper ID
2. **relevance-checker** - Evaluates paper relevance from title/abstract
3. **data-extractor** - Reads PDF and extracts structured data (use thinking model)
4. **citation-traverser** - Fetches related/cited/citing papers from OpenAlex
## Workflow
### Step 1: Project Setup
From user's description, generate project assets. User should provide:
- **Goal**: What dataset they want to create
- **Domain**: Research area and key terminology
- **Data fields**: What information to extract from papers
Create project directory with these files:
```
projects/<project_name>/
├── prompt.txt # Data extraction instructions
├── relevance_prompt.txt # Relevance criteria for papers
├── search_query.txt # OpenAlex search terms
├── bfs_queue.json # BFS queue state (see assets)
├── pdfs/ # Downloaded PDFs
└── data/ # Extracted JSON files
```
**Generate assets by creating:**
1. **prompt.txt**: Detailed instructions for extracting data from PDFs
- What fields to extract
- Domain context and terminology
- Output format (JSON structure)
- Guidelines for handling missing/ambiguous data
2. **relevance_prompt.txt**: Criteria for filtering papers
- What makes a paper relevant
- Template: `{title}` and `{abstract}` placeholders
3. **search_query.txt**: OpenAlex search query
- Domain-specific terms
- Broad enough for coverage, specific enough for relevance
### Step 2: Initial Paper Search
Search OpenAlex to populate the BFS queue:
```
GET https://api.openalex.org/works?search=<query>&per-page=25&mailto=email
```
Extract OpenAlex IDs (e.g., `W2741809807`) from results and add to `bfs_queue.json`.
**Options:**
- Use [search_openalex.py](scripts/search_openalex.py) script
- Write equivalent code in preferred language
- Install uv (`curl -LsSf https://astral.sh/uv/install.sh | sh`) and use Python directly
See [bfs_queue.py](references/bfs_queue.py) for queue implementation reference.
### Step 3: Process Queue (Loop)
Pop paper ID from queue and process with subagents:
#### 3a. Download PDF (subagent: pdf-downloader)
```
Download PDF for OpenAlex ID: <id>
Save to: projects/<name>/pdfs/<id>.pdf
Return: success/failure
```
If failed → mark as `failed: no_pdf` in queue, continue to next paper from queue.
#### 3b. Check Relevance (subagent: relevance-checker)
```
Given title and abstract from OpenAlex metadata,
evaluate using: [relevance_prompt.txt]
Return: {is_relevant: bool, reason: string}
```
If not relevant → mark as `skipped: <reason>` in queue, continue to next paper from queue.
#### 3c. Extract Data (subagent: data-extractor with thinking model)
```
Read PDF: projects/<name>/pdfs/<id>.pdf
Extract data following: [prompt.txt]
Return: structured JSON
```
Save result to `projects/<name>/data/<id>.json`.
#### 3d. Traverse Citations (subagent: citation-traverser)
```
For OpenAlex ID: <id>
Fetch: referenced_works, related_works, citing works
Return: list of new paper IDs
```
Add new IDs to queue (skip already processed/skipped/failed).
Mark current paper as `processed`.
### Step 4: Continue Until Done
Repeat Step 3 until:
- User stops the process
- Queue is empty (all papers in processed/skipped/failed state)
- User provides new seed papers or search queries
## BFS Queue Format
Use `bfs_queue.json` for stop/resume:
```json
{
"queue": ["W123", "W456"],
"processed": ["W789"],
"skipped": {"W111": "review article, no experimental data"},
"failed": {"W222": "pdf not available"}
}
```
## Key Principles
1. **Use subagents** for each processing step to preserve main context
2. **Use thinking model** for data extraction (complex reasoning needed)
3. **Handle failures gracefully** - ~30-50% of papers won't have accessible PDFs
4. **Track everything** - queue.json enables stop/resume at any point
5. **Rate limit OpenAlex** - 10 req/sec with email, 1 req/sec without
## References
- [OPENALEX.md](references/OPENALEX.md) - OpenAlex API reference
- [WORKFLOW.md](references/WORKFLOW.md) - Detailed workflow steps
- [bfs_queue.py](references/bfs_queue.py) - Queue implementation reference
- [download_pdf.py](references/download_pdf.py) - PDF download reference with some of the logic for downloading PDFsRelated Skills
scientific-schematics
Create publication-quality scientific diagrams using Nano Banana Pro AI with smart iterative refinement. Uses Gemini 3 Pro for quality review. Only regenerates if quality is below threshold for your document type. Specialized in neural network architectures, system diagrams, flowcharts, biological pathways, and complex scientific visualizations.
dara-dataset-expert
Warehouse-Prozess-Analyse mit 207 Labels, 47 Prozessen, 8 Szenarien, 13 Triggern. Vollständige Expertise für DaRa Datensatz + REFA-Methodik + Validierungslogik + Szenarioerkennung + Lagerlayout + 74 Artikel-Stammdaten + BPMN-Validierung & IST/SOLL-Vergleich. 100% faktenbasiert ohne Halluzinationen. v5.0 mit Ground Truth Central v3.0 + Multi-Order (S7/S8) + Frame-Level Validation Rules.
aggregating-event-datasets
Aggregate and summarize event datasets (logs) using OPAL statsby. Use when you need to count, sum, or calculate statistics across log events. Covers make_col for derived columns, statsby for aggregation, group_by for grouping, aggregation functions (count, sum, avg, percentile), and topk for top N results. Returns single summary row per group across entire time range. For time-series trends, see time-series-analysis skill.
add-dataset
Guide for adding a new dataset loader to AReaL. Use when user wants to add a new dataset.
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
skill-coach
Guides creation of high-quality Agent Skills with domain expertise, anti-pattern detection, and progressive disclosure best practices. Use when creating skills, reviewing existing skills, or when users mention improving skill quality, encoding expertise, or avoiding common AI tooling mistakes. Activate on keywords: create skill, review skill, skill quality, skill best practices, skill anti-patterns. NOT for general coding advice or non-skill Claude Code features.
skild
Skill package manager for AI Agents — install, manage, and publish Agent Skills.
sitrep-coordinator
Military-style Situation Report (SITREP) generation for multi-agent coordination. Creates structured status updates with completed/in-progress/blocked sections, authorization codes, handoff protocols, and clear next actions. Optimized for complex project management across multiple AI agents and human operators.
sitespeakai-automation
Automate Sitespeakai tasks via Rube MCP (Composio). Always search tools first for current schemas.
simulation-dry-run
How to run scenario tests against Gorlami fork RPCs (dry runs) before broadcasting live transactions. Covers config, seeding balances, runner flags, and safe script patterns.
simple-pr
Create a simple PR from staged changes with an auto-generated commit message
simple-gemini
Collaborative documentation and test code writing workflow using zen mcp's clink to launch gemini CLI session in WSL (via 'gemini' command) where all writing operations are executed. Use this skill when the user requests "use gemini to write test files", "use gemini to write documentation", "generate related test files", "generate an explanatory document", or similar document/test writing tasks. The gemini CLI session acts as the specialist writer, working with the main Claude model for context gathering, outline approval, and final review. For test code, codex CLI (also launched via clink) validates quality after gemini completes writing.