data-extractor

Extract numerical data from scientific figure images using Claude vision + OpenCV calibration. Supports 26+ plot types including bar charts, scatter plots, forest plots, Kaplan-Meier curves, box plots, and more.

564 stars

Best use case

data-extractor is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Extract numerical data from scientific figure images using Claude vision + OpenCV calibration. Supports 26+ plot types including bar charts, scatter plots, forest plots, Kaplan-Meier curves, box plots, and more.

Teams using data-extractor should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/data-extractor/SKILL.md --create-dirs "https://raw.githubusercontent.com/beita6969/ScienceClaw/main/skills/data-extractor/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/data-extractor/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How data-extractor Compares

Feature / Agentdata-extractorStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Extract numerical data from scientific figure images using Claude vision + OpenCV calibration. Supports 26+ plot types including bar charts, scatter plots, forest plots, Kaplan-Meier curves, box plots, and more.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 📊 Data Extractor

You are the **Data Extractor**, a ClawBio skill for digitizing scientific figures. Your role is to extract numerical data from plot images for meta-analyses and systematic reviews.

## When to Use This Skill

Route to this skill when the user:
- Provides an image file (PNG, JPG, TIFF) containing a scientific figure
- Asks to "extract data from a figure", "digitize a plot", "read values from a chart"
- Mentions "meta-analysis data extraction" or "figure digitization"
- Wants to convert a bar chart, scatter plot, or other figure to CSV/JSON

## Capabilities

### Supported Plot Types (26)
scatter, bar, line, box, violin, histogram, heatmap, forest, kaplan_meier,
dot_strip, stacked_bar, funnel, roc, volcano, waterfall, bland_altman,
paired, bubble, area, dose_response, manhattan, correlation_matrix,
error_bar, table, other

### Pipeline (4 phases)
1. **Panel Detection** — Identify sub-panels in multi-panel figures (Claude vision)
2. **Pre-Analysis** — Identify axes, scale (linear/log), legend entries, error bars (Claude tool calling)
3. **CV Calibration + Extraction** — OpenCV detects markers/bars at pixel level, Claude extracts numerical data with calibration context
4. **Validation** — Heuristic checks for axis range, series count, error bar polarity

### Output Formats
- **CSV** — One row per data point with series name, x/y values, error bars
- **JSON** — Structured ExtractedData objects with full metadata
- **Web UI** — Interactive table + SVG preview with editable cells

## Usage

### CLI
```bash
python data_extractor.py --image figure.png --output results/
python data_extractor.py --web --port 8765
python data_extractor.py --demo
```

### API (importable)
```python
from api import run
result = run(options={"image_path": "figure.png", "output_dir": "results/"})
```

### Web UI
Launch with `--web` flag. Upload images, draw boxes around plots, extract and edit data interactively.

## Input Formats
- PNG, JPG, JPEG, TIFF image files
- Screenshots from papers, posters, slides
- Multi-panel composite figures (auto-detected and split)

## Notes
- Requires ANTHROPIC_API_KEY environment variable
- Uses Claude Sonnet for pre-analysis/detection, Claude Opus for extraction
- OpenCV calibration improves accuracy for scatter/bar plots with clear markers
- Error bars are reported as ± extent (delta from mean), not absolute positions

Related Skills

world-bank-data

564
from beita6969/ScienceClaw

World Bank Open Data API for development indicators. Use when: user asks about GDP, population, poverty, health, or education statistics by country. NOT for: real-time financial data or stock prices.

wikidata-knowledge

564
from beita6969/ScienceClaw

Query Wikidata for structured knowledge using SPARQL and entity search. Use when: (1) finding structured facts about entities (people, places, organizations), (2) querying relationships between entities, (3) cross-referencing external identifiers (Wikipedia, VIAF, GND, ORCID), (4) building knowledge graphs from linked data. NOT for: full-text article content (use Wikipedia API), scientific literature (use semantic-scholar), geospatial data (use OpenStreetMap).

uniprot-database

564
from beita6969/ScienceClaw

Direct REST API access to UniProt. Protein searches, FASTA retrieval, ID mapping, Swiss-Prot/TrEMBL. For Python workflows with multiple databases, prefer bioservices (unified interface to 40+ services). Use this for direct HTTP/REST work or UniProt-specific control.

string-database

564
from beita6969/ScienceClaw

Query STRING API for protein-protein interactions (59M proteins, 20B interactions). Network analysis, GO/KEGG enrichment, interaction discovery, 5000+ species, for systems biology.

reactome-database

564
from beita6969/ScienceClaw

Query Reactome REST API for pathway analysis, enrichment, gene-pathway mapping, disease pathways, molecular interactions, expression analysis, for systems biology studies.

pubmed-database

564
from beita6969/ScienceClaw

Direct REST API access to PubMed. Advanced Boolean/MeSH queries, E-utilities API, batch processing, citation management. For Python workflows, prefer biopython (Bio.Entrez). Use this for direct HTTP/REST work or custom API implementations.

pubchem-database

564
from beita6969/ScienceClaw

Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics.

pdb-database

564
from beita6969/ScienceClaw

Access RCSB PDB for 3D protein/nucleic acid structures. Search by text/sequence/structure, download coordinates (PDB/mmCIF), retrieve metadata, for structural biology and drug discovery.

opentargets-database

564
from beita6969/ScienceClaw

Query Open Targets Platform for target-disease associations, drug target discovery, tractability/safety data, genetics/omics evidence, known drugs, for therapeutic target identification.

openalex-database

564
from beita6969/ScienceClaw

Query and analyze scholarly literature using the OpenAlex database. This skill should be used when searching for academic papers, analyzing research trends, finding works by authors or institutions, tracking citations, discovering open access publications, or conducting bibliometric analysis across 240M+ scholarly works. Use for literature searches, research output analysis, citation analysis, and academic database queries.

kegg-database

564
from beita6969/ScienceClaw

Direct REST API access to KEGG (academic use only). Pathway analysis, gene-pathway mapping, metabolic pathways, drug interactions, ID conversion. For Python workflows with multiple databases, prefer bioservices. Use this for direct HTTP/REST work or KEGG-specific control.

gnomad-database

564
from beita6969/ScienceClaw

Query gnomAD (Genome Aggregation Database) for population allele frequencies, variant constraint scores (pLI, LOEUF), and loss-of-function intolerance. Essential for variant pathogenicity interpretation, rare disease genetics, and identifying loss-of-function intolerant genes.