extract-from-pdfs
This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.
Best use case
extract-from-pdfs is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.
This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.
Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.
Practical example
Example input
Use the "extract-from-pdfs" skill to help with this workflow task. Context: This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.
Example output
A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.
When to use this skill
- Use this skill when you want a reusable workflow rather than writing the same prompt again and again.
When not to use this skill
- Do not use this when you only need a one-off answer and do not need a reusable workflow.
- Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/extract-from-pdfs/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How extract-from-pdfs Compares
| Feature / Agent | extract-from-pdfs | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agent for Product Research
Browse AI agent skills for product research, competitive analysis, customer discovery, and structured product decision support.
AI Agent for SaaS Idea Validation
Use AI agent skills for SaaS idea validation, market research, customer discovery, competitor analysis, and documenting startup hypotheses.
SKILL.md Source
# Extract Structured Data from Scientific PDFs ## Purpose Extract standardized, structured data from scientific PDF literature using Claude's vision capabilities. Transform PDF collections into validated databases ready for statistical analysis in Python, R, or other frameworks. **Core capabilities:** - Organize metadata from BibTeX, RIS, directories, or DOI lists - Filter papers by abstract using Claude (Haiku/Sonnet) or local models (Ollama) - Extract structured data from PDFs with customizable schemas - Repair and validate JSON outputs automatically - Enrich with external databases (GBIF, WFO, GeoNames, PubChem, NCBI) - Calculate precision/recall metrics for quality assurance - Export to Python, R, CSV, Excel, or SQLite ## When to Use This Skill Use when: - Conducting systematic literature reviews requiring data extraction - Building databases from scientific publications - Converting PDF collections to structured datasets - Validating extraction quality with ground truth metrics - Comparing extraction approaches (different models, prompts) Do not use for: - Single PDF summarization (use basic PDF reading instead) - Full-text PDF search (use document search tools) - PDF editing or manipulation ## Getting Started ### 1. Initial Setup Read the setup guide for installation and configuration: ```bash cat references/setup_guide.md ``` Key setup steps: - Install dependencies: `conda env create -f environment.yml` - Set API keys: `export ANTHROPIC_API_KEY='your-key'` - Optional: Install Ollama for free local filtering ### 2. Define Extraction Requirements **Ask the user:** - Research domain and extraction goals - How PDFs are organized (reference manager, directory, DOI list) - Approximate collection size - Preferred analysis environment (Python, R, etc.) **Provide 2-3 example PDFs** to analyze structure and design schema. ### 3. Design Extraction Schema Create custom schema from template: ```bash cp assets/schema_template.json my_schema.json ``` Customize for the specific domain: - Set `objective` describing what to extract - Define `output_schema` with field types and descriptions - Add domain-specific `instructions` for Claude - Provide `output_example` showing desired format See `assets/example_flower_visitors_schema.json` for real-world ecology example. ## Workflow Execution ### Complete Pipeline Run the 6-step pipeline (plus optional validation): ```bash # Step 1: Organize metadata python scripts/01_organize_metadata.py \ --source-type bibtex \ --source library.bib \ --pdf-dir pdfs/ \ --output metadata.json # Step 2: Filter papers (optional - recommended) # Choose backend: anthropic-haiku (cheap), anthropic-sonnet (accurate), ollama (free) python scripts/02_filter_abstracts.py \ --metadata metadata.json \ --backend anthropic-haiku \ --use-batches \ --output filtered_papers.json # Step 3: Extract from PDFs python scripts/03_extract_from_pdfs.py \ --metadata filtered_papers.json \ --schema my_schema.json \ --method batches \ --output extracted_data.json # Step 4: Repair JSON python scripts/04_repair_json.py \ --input extracted_data.json \ --schema my_schema.json \ --output cleaned_data.json # Step 5: Validate with APIs python scripts/05_validate_with_apis.py \ --input cleaned_data.json \ --apis my_api_config.json \ --output validated_data.json # Step 6: Export to analysis format python scripts/06_export_database.py \ --input validated_data.json \ --format python \ --output results ``` ### Validation (Optional but Recommended) Calculate extraction quality metrics: ```bash # Step 7: Sample papers for annotation python scripts/07_prepare_validation_set.py \ --extraction-results cleaned_data.json \ --schema my_schema.json \ --sample-size 20 \ --strategy stratified \ --output validation_set.json # Step 8: Manually annotate (edit validation_set.json) # Fill ground_truth field for each sampled paper # Step 9: Calculate metrics python scripts/08_calculate_validation_metrics.py \ --annotations validation_set.json \ --output validation_metrics.json \ --report validation_report.txt ``` Validation produces precision, recall, and F1 metrics per field and overall. ## Detailed Documentation Access comprehensive guides in the `references/` directory: **Setup and installation:** ```bash cat references/setup_guide.md ``` **Complete workflow with examples:** ```bash cat references/workflow_guide.md ``` **Validation methodology:** ```bash cat references/validation_guide.md ``` **API integration details:** ```bash cat references/api_reference.md ``` ## Customization ### Schema Customization Modify `my_schema.json` to match the research domain: 1. **Objective:** Describe what data to extract 2. **Instructions:** Step-by-step extraction guidance 3. **Output schema:** JSON schema defining structure 4. **Important notes:** Domain-specific rules 5. **Examples:** Show desired output format Use imperative language in instructions. Be specific about data types, required vs optional fields, and edge cases. ### API Configuration Configure external database validation in `my_api_config.json`: Map extracted fields to validation APIs: - `gbif_taxonomy` - Biological taxonomy - `wfo_plants` - Plant names specifically - `geonames` - Geographic locations - `geocode` - Address to coordinates - `pubchem` - Chemical compounds - `ncbi_gene` - Gene identifiers See `assets/example_api_config_ecology.json` for ecology-specific example. ### Filtering Customization Edit filtering criteria in `scripts/02_filter_abstracts.py` (line 74): Replace TODO section with domain-specific criteria: - What constitutes primary data vs review? - What data types are relevant? - What scope (geographic, temporal, taxonomic) is needed? Use conservative criteria (when in doubt, include paper) to avoid false negatives. ## Cost Optimization **Backend selection for filtering (Step 2):** - Ollama (local): $0 - Best for privacy and high volume - Haiku (API): ~$0.25/M tokens - Best balance of cost/quality - Sonnet (API): ~$3/M tokens - Best for complex filtering **Typical costs for 100 papers:** - With filtering (Haiku + Sonnet): ~$4 - With local Ollama + Sonnet: ~$3.75 - Without filtering (Sonnet only): ~$7.50 **Optimization strategies:** - Use abstract filtering to reduce PDF processing - Use local Ollama for filtering (free) - Enable prompt caching with `--use-caching` - Process in batches with `--use-batches` ## Quality Assurance **Validation workflow provides:** - Precision: % of extracted items that are correct - Recall: % of true items that were extracted - F1 score: Harmonic mean of precision and recall - Per-field metrics: Identify weak fields **Use metrics to:** - Establish baseline extraction quality - Compare different approaches (models, prompts, schemas) - Identify areas for improvement - Report extraction quality in publications **Recommended sample sizes:** - Small projects (<100 papers): 10-20 papers - Medium projects (100-500 papers): 20-50 papers - Large projects (>500 papers): 50-100 papers ## Iterative Improvement 1. Run initial extraction with baseline schema 2. Validate on sample using Steps 7-9 3. Analyze field-level metrics and error patterns 4. Revise schema, prompts, or model selection 5. Re-extract and re-validate 6. Compare metrics to verify improvement 7. Repeat until acceptable quality achieved See `references/validation_guide.md` for detailed guidance on interpreting metrics and improving extraction quality. ## Available Scripts **Data organization:** - `scripts/01_organize_metadata.py` - Standardize PDFs and metadata **Filtering:** - `scripts/02_filter_abstracts.py` - Filter by abstract (Haiku/Sonnet/Ollama) **Extraction:** - `scripts/03_extract_from_pdfs.py` - Extract from PDFs with Claude vision **Processing:** - `scripts/04_repair_json.py` - Repair and validate JSON - `scripts/05_validate_with_apis.py` - Enrich with external databases - `scripts/06_export_database.py` - Export to analysis formats **Validation:** - `scripts/07_prepare_validation_set.py` - Sample papers for annotation - `scripts/08_calculate_validation_metrics.py` - Calculate P/R/F1 metrics ## Assets **Templates:** - `assets/schema_template.json` - Blank extraction schema template - `assets/api_config_template.json` - API validation configuration template **Examples:** - `assets/example_flower_visitors_schema.json` - Ecology extraction example - `assets/example_api_config_ecology.json` - Ecology API validation example
Related Skills
security-requirement-extraction
Derive security requirements from threat models and business context. Use when translating threats into actionable requirements, creating security user stories, or building security test cases.
extract
Extract and consolidate reusable components, design tokens, and patterns into your design system. Identifies opportunities for systematic reuse and enriches your component library.
screenshot-feature-extractor
Analyze product screenshots to extract feature lists and generate development task checklists. Use when: (1) Analyzing competitor product screenshots for feature extraction, (2) Generating PRD/task lists from UI designs, (3) Batch analyzing multiple app screens, (4) Conducting competitive analysis from visual references.
control-loop-extraction
Extract and analyze agent reasoning loops, step functions, and termination conditions. Use when needing to (1) understand how an agent framework implements reasoning (ReAct, Plan-and-Solve, Reflection, etc.), (2) locate the core decision-making logic, (3) analyze loop mechanics and termination conditions, (4) document the step-by-step execution flow of an agent, or (5) compare reasoning patterns across frameworks.
star-story-extraction
Auto-invoke after task completion to extract interview-ready STAR stories from completed work.
resume-bullet-extraction
Auto-invoke after task completion to generate powerful resume bullet points from completed work.
design-spec-extraction
Extract comprehensive JSON design specifications from visual sources including Figma exports, UI mockups, screenshots, or live website captures. Produces W3C DTCG-compliant output with component trees, suitable for code generation, design documentation, and developer handoff.
standards-extraction
Extract coding standards and conventions from CONTRIBUTING.md, .editorconfig, linter configs. Use for onboarding and ensuring consistent contributions.
competitive-ads-extractor
Extracts and analyzes competitors' ads from ad libraries (Facebook, LinkedIn, etc.) to understand what messaging, problems, and creative approaches are working. Helps inspire and improve your own ad campaigns.
extract-design-system
Extract design primitives from a public website and generate starter token files for your project.
epub-chapter-extractor
Extract all chapters from an EPUB file into separate markdown files. Use when the user wants to split an EPUB into individual chapter files, extract EPUB chapters, or convert an ebook to separate markdown documents.
pdf-page-extract
Extract rich data from PDF pages including text spans with metadata, rendered PNG images, and page mapping. Creates persistent artifacts for downstream processing.