biopython-entrez

Use Bio.Entrez to access NCBI databases (e.g., PubMed/GenBank) for searching, fetching summaries, and downloading records when your workflow needs to call the NCBI E-utilities API over the network.

53 stars

byaipoch

View on GitHub Installation ↓

Best use case

biopython-entrez is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Use Bio.Entrez to access NCBI databases (e.g., PubMed/GenBank) for searching, fetching summaries, and downloading records when your workflow needs to call the NCBI E-utilities API over the network.

Teams using biopython-entrez should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/biopython-entrez/SKILL.md --create-dirs "https://raw.githubusercontent.com/aipoch/medical-research-skills/main/scientific-skills/Evidence Insight/biopython-entrez/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/biopython-entrez/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How biopython-entrez Compares

Feature / Agent	biopython-entrez	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Use Bio.Entrez to access NCBI databases (e.g., PubMed/GenBank) for searching, fetching summaries, and downloading records when your workflow needs to call the NCBI E-utilities API over the network.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)

## When to Use

- You need to search PubMed for articles by keyword, author, journal, or date range and then retrieve metadata or abstracts.
- You want to download GenBank records (e.g., nucleotide/protein sequences) in batch given accession IDs or search queries.
- You need to convert identifiers or discover related records across NCBI databases (e.g., PubMed ↔ PMC, Gene ↔ Protein) via cross-links.
- You must retrieve lightweight summaries (titles, IDs, basic metadata) before deciding which full records to fetch.
- You are integrating NCBI E-utilities into an automated pipeline and need API key usage and rate-limit-aware requests.

## Key Features

- Supports core NCBI E-utilities via `Bio.Entrez`: `esearch`, `efetch`, `esummary`, `elink`.
- Query-based searching and ID list retrieval for downstream batch operations.
- Batch downloading of records in common formats (e.g., GenBank, FASTA, XML).
- API key configuration and rate-limit-friendly request patterns.
- XML response parsing using Biopython’s Entrez parsers for structured results.
- Standardized configuration and invocation conventions:
  - Write runtime configuration to `config/task_config.json`.
  - Invoke tasks via `python scripts/<task_name>.py`.
  - Avoid stacking many CLI `--` parameters; prefer config files.
  - Use explicit UTF-8 encoding for file I/O and `ensure_ascii=False` for JSON output.

## Dependencies

- `biopython>=1.80`

## Example Usage

The following example is a complete, runnable script that:
1) searches PubMed, 2) retrieves summaries for the top results, and 3) writes output to JSON.

**1) Create `config/task_config.json`:**
```json
{
  "email": "your-email@example.com",
  "api_key": "",
  "db": "pubmed",
  "term": "CRISPR Cas9 2020[PDAT]",
  "retmax": 5,
  "out_json": "outputs/pubmed_summaries.json"
}
```

**2) Create `scripts/pubmed_summaries.py`:**
```python
import json
import os
import time
from typing import Any, Dict, List

from Bio import Entrez


def load_config(path: str) -> Dict[str, Any]:
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)


def ensure_parent_dir(path: str) -> None:
    parent = os.path.dirname(path)
    if parent:
        os.makedirs(parent, exist_ok=True)


def main() -> None:
    cfg = load_config("config/task_config.json")

    Entrez.email = cfg["email"]
    api_key = cfg.get("api_key") or ""
    if api_key:
        Entrez.api_key = api_key

    db = cfg.get("db", "pubmed")
    term = cfg["term"]
    retmax = int(cfg.get("retmax", 20))
    out_json = cfg.get("out_json", "outputs/pubmed_summaries.json")

    # 1) ESearch: get IDs
    with Entrez.esearch(db=db, term=term, retmax=retmax, usehistory="n") as handle:
        search_result = Entrez.read(handle)

    id_list: List[str] = search_result.get("IdList", [])
    if not id_list:
        ensure_parent_dir(out_json)
        with open(out_json, "w", encoding="utf-8") as f:
            json.dump({"query": term, "count": 0, "items": []}, f, ensure_ascii=False, indent=2)
        return

    # Be polite with NCBI: small delay (especially without API key)
    time.sleep(0.34 if api_key else 0.5)

    # 2) ESummary: get summaries for IDs
    with Entrez.esummary(db=db, id=",".join(id_list), retmode="xml") as handle:
        summary_result = Entrez.read(handle)

    items = []
    for docsum in summary_result:
        items.append({
            "id": str(docsum.get("Id", "")),
            "title": str(docsum.get("Title", "")),
            "pubdate": str(docsum.get("PubDate", "")),
            "source": str(docsum.get("Source", "")),
            "authors": [str(a.get("Name", "")) for a in docsum.get("AuthorList", [])],
        })

    payload = {
        "query": term,
        "count": len(items),
        "items": items,
    }

    ensure_parent_dir(out_json)
    with open(out_json, "w", encoding="utf-8") as f:
        json.dump(payload, f, ensure_ascii=False, indent=2)


if __name__ == "__main__":
    main()
```

**3) Run:**
```bash
python scripts/pubmed_summaries.py
```

## Implementation Details

- **Core E-utilities mapping**
  - `ESearch`: builds a query against an NCBI database and returns matching IDs (and optionally WebEnv/QueryKey for history-based batching).
  - `ESummary`: returns lightweight document summaries for a list of IDs.
  - `EFetch`: downloads full records (e.g., GenBank/FASTA/XML) for IDs; choose `rettype`/`retmode` based on the target database.
  - `ELink`: discovers cross-database relationships (e.g., PubMed → PMC, Gene → Protein).

- **Batching strategy**
  - Prefer `ESearch` to obtain IDs, then call `ESummary`/`EFetch` in chunks (e.g., 100–500 IDs per request depending on payload size).
  - For large jobs, consider `usehistory="y"` in `ESearch` and then fetch via `WebEnv`/`QueryKey` to avoid very long ID lists.

- **Rate limiting and API key**
  - NCBI enforces request limits; using an API key increases allowed throughput.
  - Implement a small delay between requests and retry on transient network errors (HTTP 429/5xx) with backoff.

- **Parsing**
  - Use `Entrez.read(handle)` for structured parsing of XML responses into Python objects.
  - For raw text formats (e.g., FASTA), use `handle.read()` and write to disk with `encoding="utf-8"` where applicable.

- **Configuration and I/O conventions**
  - Store runtime parameters in `config/task_config.json` as an intermediate artifact.
  - Avoid complex CLI flags; keep scripts callable as `python scripts/<task_name>.py`.
  - Always specify `encoding="utf-8"` for file I/O and use `ensure_ascii=False` for JSON outputs.

- **Reference**
  - See `references/databases.md` for database notes and selection guidance.

Related Skills

biopython

from aipoch/medical-research-skills

A comprehensive toolbox for computational molecular biology; use it when you need programmatic sequence/structure parsing, batch bioinformatics pipelines, or automated NCBI/BLAST workflows.

biopython-structure

from aipoch/medical-research-skills

Use Bio.PDB to parse and analyze protein structures (PDB/mmCIF) for structural bioinformatics tasks; use when you need structure parsing, geometry calculations, or structural comparison/superposition.

biopython-sequence-io

from aipoch/medical-research-skills

Use Biopython to read/write/convert biological sequence files (FASTA/GenBank/FASTQ, etc.) and perform basic sequence operations; use when you need reliable sequence I/O, lightweight sequence manipulation, or scalable processing of large sequence datasets.

biopython-phylo

from aipoch/medical-research-skills

Use Bio.Phylo to read/write phylogenetic trees and perform visualization and statistics; use when tree parsing/conversion, pruning/rerooting, distance calculation, or plotting is required.

biopython-alignment

from aipoch/medical-research-skills

Sequence alignment and alignment file processing with Biopython (Bio.Align/Bio.AlignIO), triggered when you need global/local pairwise alignment, MSA read/write/format conversion, or alignment statistics/filtering.

biopython-advanced

from aipoch/medical-research-skills

Advanced Biopython modules for motifs, population genetics, sequence utilities, restriction analysis, clustering, and GenomeDiagram visualization; use when you need extended bioinformatics analysis beyond basic sequence I/O and alignment.

skill-auditor

from aipoch/medical-research-skills

A comprehensive auditor for any agent skill — including Manus, OpenClaw/ClawHub, Claude, LobeHub, or custom SKILL.md-based skills. Use this skill whenever a user wants to evaluate, audit, review, score, or quality-check an agent skill before publishing, updating, or deploying. Covers two hard veto gates (structural redlines + research integrity redlines), static quality scoring across 25 criteria (ISO 25010 + OpenSSF + Agent), dynamic test input generation, multi-mode execution testing, multi-layer output evaluation with five specialized category rubrics (Evidence Insight / Protocol Design / Data Analysis / Academic Writing / Other), a Research Veto that applies to all four research categories, human eval viewer generation, actionable P0/P1/P2 optimization recommendations, and automatic skill improvement that outputs a polished, production-ready SKILL.md. Also use whenever a user says "audit my skill", "evaluate my skill", "improve my skill", or wants a corrected version after evaluation.

two-sample-mr-research-planner

from aipoch/medical-research-skills

Generates complete two-sample Mendelian randomization (MR) research designs from a user-provided research direction. Use when users want to design, plan, or build a study using two-sample MR to test causal relationships. Triggers:"design a two-sample MR study", "build a publishable MR paper", "test whether this biomarker causally affects this disease", "generate Lite/Standard/Advanced MR plans", "screen multiple exposures with MR", "bidirectional MR design", "causal inference using GWAS summary statistics", or "I want to study X and Y using MR". Always outputs four workload configurations (Lite / Standard / Advanced / Publication+) with a recommended primary plan, step-by-step workflow, figure plan, validation strategy, minimal executable version, and publication upgrade path.

research-proposal-generator

from aipoch/medical-research-skills

Generates a comprehensive research proposal design based on input literature, including hypothesis, mechanism verification, and budget. Use when the user wants to design a research project from a paper.

research-grants

from aipoch/medical-research-skills

Write competitive research proposals for NSF, NIH, DOE, DARPA, and Taiwan's NSTC when you need agency-compliant narratives, budgets, and review-criteria alignment for a specific solicitation/FOA/BAA.

protocol-standardization

from aipoch/medical-research-skills

Standardize fragmented experimental steps into reproducible protocol documents when you need method organization, lab SOP drafting, or cross-operator reproducibility; missing parameters must be explicitly marked as "To be supplemented/Not provided".

prospero-registration-helper

from aipoch/medical-research-skills

Assists researchers in generating PROSPERO registration content for meta-analyses from a title and optional protocol. Use when the user wants to draft a PROSPERO registration form.