scientific-papers-to-dataset

Build structured datasets from academic papers. Use when the user wants to extract structured data from scientific literature, traverse citation graphs, search OpenAlex for papers, or create datasets from PDFs for research purposes.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

scientific-papers-to-dataset is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using scientific-papers-to-dataset should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/scientific-papers-to-dataset/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/scientific-papers-to-dataset/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/scientific-papers-to-dataset/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How scientific-papers-to-dataset Compares

Feature / Agent	scientific-papers-to-dataset	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# scientific-papers-to-dataset

Build datasets by extracting structured data from academic papers and traversing citation graphs.

## When to Use This Skill

Use this skill when the user wants to:

- Create a dataset from academic papers
- Extract structured information from PDFs
- Search for papers on a topic using OpenAlex
- Traverse citation graphs to find related papers

## Architecture: Subagent Pattern

> [!IMPORTANT]
> Use **subagents** for PDF download, relevance checking, data extraction, and citation traversal to keep the main context clean.

### Recommended Subagents

1. **pdf-downloader** - Downloads PDF for a paper ID
2. **relevance-checker** - Evaluates paper relevance from title/abstract
3. **data-extractor** - Reads PDF and extracts structured data (use thinking model)
4. **citation-traverser** - Fetches related/cited/citing papers from OpenAlex

## Workflow

### Step 1: Project Setup

From user's description, generate project assets. User should provide:

- **Goal**: What dataset they want to create
- **Domain**: Research area and key terminology
- **Data fields**: What information to extract from papers

Create project directory with these files:

```
projects/<project_name>/
├── prompt.txt           # Data extraction instructions
├── relevance_prompt.txt # Relevance criteria for papers
├── search_query.txt     # OpenAlex search terms
├── bfs_queue.json       # BFS queue state (see assets)
├── pdfs/                # Downloaded PDFs
└── data/                # Extracted JSON files
```

**Generate assets by creating:**

1. **prompt.txt**: Detailed instructions for extracting data from PDFs
   - What fields to extract
   - Domain context and terminology
   - Output format (JSON structure)
   - Guidelines for handling missing/ambiguous data

2. **relevance_prompt.txt**: Criteria for filtering papers
   - What makes a paper relevant
   - Template: `{title}` and `{abstract}` placeholders

3. **search_query.txt**: OpenAlex search query
   - Domain-specific terms
   - Broad enough for coverage, specific enough for relevance

### Step 2: Initial Paper Search

Search OpenAlex to populate the BFS queue:

```
GET https://api.openalex.org/works?search=<query>&per-page=25&mailto=email
```

Extract OpenAlex IDs (e.g., `W2741809807`) from results and add to `bfs_queue.json`.

**Options:**

- Use [search_openalex.py](scripts/search_openalex.py) script
- Write equivalent code in preferred language
- Install uv (`curl -LsSf https://astral.sh/uv/install.sh | sh`) and use Python directly

See [bfs_queue.py](references/bfs_queue.py) for queue implementation reference.

### Step 3: Process Queue (Loop)

Pop paper ID from queue and process with subagents:

#### 3a. Download PDF (subagent: pdf-downloader)

```
Download PDF for OpenAlex ID: <id>
Save to: projects/<name>/pdfs/<id>.pdf
Return: success/failure
```

If failed → mark as `failed: no_pdf` in queue, continue to next paper from queue.

#### 3b. Check Relevance (subagent: relevance-checker)

```
Given title and abstract from OpenAlex metadata,
evaluate using: [relevance_prompt.txt]
Return: {is_relevant: bool, reason: string}
```

If not relevant → mark as `skipped: <reason>` in queue, continue to next paper from queue.

#### 3c. Extract Data (subagent: data-extractor with thinking model)

```
Read PDF: projects/<name>/pdfs/<id>.pdf
Extract data following: [prompt.txt]
Return: structured JSON
```

Save result to `projects/<name>/data/<id>.json`.

#### 3d. Traverse Citations (subagent: citation-traverser)

```
For OpenAlex ID: <id>
Fetch: referenced_works, related_works, citing works
Return: list of new paper IDs
```

Add new IDs to queue (skip already processed/skipped/failed).
Mark current paper as `processed`.

### Step 4: Continue Until Done

Repeat Step 3 until:

- User stops the process
- Queue is empty (all papers in processed/skipped/failed state)
- User provides new seed papers or search queries

## BFS Queue Format

Use `bfs_queue.json` for stop/resume:

```json
{
  "queue": ["W123", "W456"],
  "processed": ["W789"],
  "skipped": {"W111": "review article, no experimental data"},
  "failed": {"W222": "pdf not available"}
}
```

## Key Principles

1. **Use subagents** for each processing step to preserve main context
2. **Use thinking model** for data extraction (complex reasoning needed)
3. **Handle failures gracefully** - ~30-50% of papers won't have accessible PDFs
4. **Track everything** - queue.json enables stop/resume at any point
5. **Rate limit OpenAlex** - 10 req/sec with email, 1 req/sec without

## References

- [OPENALEX.md](references/OPENALEX.md) - OpenAlex API reference
- [WORKFLOW.md](references/WORKFLOW.md) - Detailed workflow steps
- [bfs_queue.py](references/bfs_queue.py) - Queue implementation reference
- [download_pdf.py](references/download_pdf.py) - PDF download reference with some of the logic for downloading PDFs

Related Skills

scientific-schematics

from diegosouzapw/awesome-omni-skill

Create publication-quality scientific diagrams using Nano Banana Pro AI with smart iterative refinement. Uses Gemini 3 Pro for quality review. Only regenerates if quality is below threshold for your document type. Specialized in neural network architectures, system diagrams, flowcharts, biological pathways, and complex scientific visualizations.

dara-dataset-expert

from diegosouzapw/awesome-omni-skill

Warehouse-Prozess-Analyse mit 207 Labels, 47 Prozessen, 8 Szenarien, 13 Triggern. Vollständige Expertise für DaRa Datensatz + REFA-Methodik + Validierungslogik + Szenarioerkennung + Lagerlayout + 74 Artikel-Stammdaten + BPMN-Validierung & IST/SOLL-Vergleich. 100% faktenbasiert ohne Halluzinationen. v5.0 mit Ground Truth Central v3.0 + Multi-Order (S7/S8) + Frame-Level Validation Rules.

aggregating-event-datasets

from diegosouzapw/awesome-omni-skill

Aggregate and summarize event datasets (logs) using OPAL statsby. Use when you need to count, sum, or calculate statistics across log events. Covers make_col for derived columns, statsby for aggregation, group_by for grouping, aggregation functions (count, sum, avg, percentile), and topk for top N results. Returns single summary row per group across entire time range. For time-series trends, see time-series-analysis skill.

add-dataset

from diegosouzapw/awesome-omni-skill

Guide for adding a new dataset loader to AReaL. Use when user wants to add a new dataset.

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

skill-coach

from diegosouzapw/awesome-omni-skill

Guides creation of high-quality Agent Skills with domain expertise, anti-pattern detection, and progressive disclosure best practices. Use when creating skills, reviewing existing skills, or when users mention improving skill quality, encoding expertise, or avoiding common AI tooling mistakes. Activate on keywords: create skill, review skill, skill quality, skill best practices, skill anti-patterns. NOT for general coding advice or non-skill Claude Code features.

skild

from diegosouzapw/awesome-omni-skill

Skill package manager for AI Agents — install, manage, and publish Agent Skills.

sitrep-coordinator

from diegosouzapw/awesome-omni-skill

Military-style Situation Report (SITREP) generation for multi-agent coordination. Creates structured status updates with completed/in-progress/blocked sections, authorization codes, handoff protocols, and clear next actions. Optimized for complex project management across multiple AI agents and human operators.

sitespeakai-automation

from diegosouzapw/awesome-omni-skill

Automate Sitespeakai tasks via Rube MCP (Composio). Always search tools first for current schemas.

simulation-dry-run

from diegosouzapw/awesome-omni-skill

How to run scenario tests against Gorlami fork RPCs (dry runs) before broadcasting live transactions. Covers config, seeding balances, runner flags, and safe script patterns.

simple-pr

from diegosouzapw/awesome-omni-skill

Create a simple PR from staged changes with an auto-generated commit message

simple-gemini

from diegosouzapw/awesome-omni-skill

Collaborative documentation and test code writing workflow using zen mcp's clink to launch gemini CLI session in WSL (via 'gemini' command) where all writing operations are executed. Use this skill when the user requests "use gemini to write test files", "use gemini to write documentation", "generate related test files", "generate an explanatory document", or similar document/test writing tasks. The gemini CLI session acts as the specialist writer, working with the main Claude model for context gathering, outline approval, and final review. For test code, codex CLI (also launched via clink) validates quality after gemini completes writing.