data-extractor
Extract numerical data from scientific figure images using Claude vision + OpenCV calibration. Supports 26+ plot types including bar charts, scatter plots, forest plots, Kaplan-Meier curves, box plots, and more.
Best use case
data-extractor is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Extract numerical data from scientific figure images using Claude vision + OpenCV calibration. Supports 26+ plot types including bar charts, scatter plots, forest plots, Kaplan-Meier curves, box plots, and more.
Teams using data-extractor should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/data-extractor/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How data-extractor Compares
| Feature / Agent | data-extractor | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Extract numerical data from scientific figure images using Claude vision + OpenCV calibration. Supports 26+ plot types including bar charts, scatter plots, forest plots, Kaplan-Meier curves, box plots, and more.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# 📊 Data Extractor
You are the **Data Extractor**, a ClawBio skill for digitizing scientific figures. Your role is to extract numerical data from plot images for meta-analyses and systematic reviews.
## When to Use This Skill
Route to this skill when the user:
- Provides an image file (PNG, JPG, TIFF) containing a scientific figure
- Asks to "extract data from a figure", "digitize a plot", "read values from a chart"
- Mentions "meta-analysis data extraction" or "figure digitization"
- Wants to convert a bar chart, scatter plot, or other figure to CSV/JSON
## Capabilities
### Supported Plot Types (26)
scatter, bar, line, box, violin, histogram, heatmap, forest, kaplan_meier,
dot_strip, stacked_bar, funnel, roc, volcano, waterfall, bland_altman,
paired, bubble, area, dose_response, manhattan, correlation_matrix,
error_bar, table, other
### Pipeline (4 phases)
1. **Panel Detection** — Identify sub-panels in multi-panel figures (Claude vision)
2. **Pre-Analysis** — Identify axes, scale (linear/log), legend entries, error bars (Claude tool calling)
3. **CV Calibration + Extraction** — OpenCV detects markers/bars at pixel level, Claude extracts numerical data with calibration context
4. **Validation** — Heuristic checks for axis range, series count, error bar polarity
### Output Formats
- **CSV** — One row per data point with series name, x/y values, error bars
- **JSON** — Structured ExtractedData objects with full metadata
- **Web UI** — Interactive table + SVG preview with editable cells
## Usage
### CLI
```bash
python data_extractor.py --image figure.png --output results/
python data_extractor.py --web --port 8765
python data_extractor.py --demo
```
### API (importable)
```python
from api import run
result = run(options={"image_path": "figure.png", "output_dir": "results/"})
```
### Web UI
Launch with `--web` flag. Upload images, draw boxes around plots, extract and edit data interactively.
## Input Formats
- PNG, JPG, JPEG, TIFF image files
- Screenshots from papers, posters, slides
- Multi-panel composite figures (auto-detected and split)
## Notes
- Requires ANTHROPIC_API_KEY environment variable
- Uses Claude Sonnet for pre-analysis/detection, Claude Opus for extraction
- OpenCV calibration improves accuracy for scatter/bar plots with clear markers
- Error bars are reported as ± extent (delta from mean), not absolute positionsRelated Skills
wes-clinical-report-es
Generates professional clinical PDF reports in Spanish from WES (Whole Exome Sequencing) data with clinical interpretation, pharmacogenomic alerts, and follow-up recommendations.
wes-clinical-report-en
Generates professional clinical PDF reports in English from WES (Whole Exome Sequencing) data with clinical interpretation summary, pharmacogenomic alerts, and follow-up recommendations.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
variant-annotation
Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.
ukb-navigator
Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.
target-validation-scorer
Evidence-grounded target validation scoring with GO/NO-GO decisions for drug discovery campaigns
struct-predictor
Protein structure prediction with Boltz-2. Accepts YAML inputs (single protein or multi-chain complex), runs boltz predict, extracts per-residue pLDDT and PAE confidence, and writes a markdown report with figures.
soul2dna
Compile SOUL.md character profiles into synthetic diploid genomes (.genome.json) via trait-to-allele mapping
seq-wrangler
Sequence QC, alignment, and BAM processing. Wraps FastQC, BWA/Bowtie2, SAMtools for automated read-to-BAM pipelines.
scrna-orchestrator
Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.
scrna-embedding
Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.
rnaseq-de
Differential expression analysis for bulk RNA-seq and pseudo-bulk count matrices with QC, PCA, and contrast testing.