biopython
A comprehensive toolbox for computational molecular biology; use it when you need programmatic sequence/structure parsing, batch bioinformatics pipelines, or automated NCBI/BLAST workflows.
Best use case
biopython is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
A comprehensive toolbox for computational molecular biology; use it when you need programmatic sequence/structure parsing, batch bioinformatics pipelines, or automated NCBI/BLAST workflows.
Teams using biopython should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/biopython/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How biopython Compares
| Feature / Agent | biopython | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
A comprehensive toolbox for computational molecular biology; use it when you need programmatic sequence/structure parsing, batch bioinformatics pipelines, or automated NCBI/BLAST workflows.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)
## When to Use
Use this skill when you need to:
- Batch-process DNA/RNA/protein sequences (translation, reverse complement, statistics) as part of a custom pipeline.
- Parse, validate, convert, or stream large bioinformatics files (FASTA/FASTQ/GenBank/PDB/mmCIF) without loading everything into memory.
- Programmatically query and download records from NCBI (GenBank, PubMed, Gene, Protein) via `Bio.Entrez`, respecting rate limits.
- Automate BLAST searches (web or local) and parse results to extract top hits and metadata.
- Build or manipulate phylogenetic trees from alignments or distance matrices (e.g., NJ trees) for downstream analysis.
> Note: For quick one-off queries, tools like **gget** may be more convenient; for multi-service API aggregation, **bioservices** may be a better fit.
## Key Features
- **Sequence objects and utilities**: `Bio.Seq`, `Bio.SeqRecord`, `Bio.SeqUtils` (GC fraction, molecular weight, translation, etc.).
- **File I/O and format conversion**: `Bio.SeqIO`, `Bio.AlignIO` for FASTA/FASTQ/GenBank and alignment formats.
- **NCBI access**: `Bio.Entrez` for `esearch`, `efetch`, `elink`, and structured parsing via `Entrez.read`.
- **BLAST**: `Bio.Blast.NCBIWWW` for remote BLAST and `Bio.Blast.NCBIXML` for XML parsing.
- **Structural bioinformatics**: `Bio.PDB` for PDB/mmCIF parsing, hierarchy traversal, and geometry calculations.
- **Phylogenetics**: `Bio.Phylo` and `Bio.Phylo.TreeConstruction` for tree I/O, distances, and construction.
Reference guides (if present in this repository) can be consulted for deeper module-specific patterns:
- `references/sequence_io.md`
- `references/alignment.md`
- `references/databases.md`
- `references/blast.md`
- `references/structure.md`
- `references/phylogenetics.md`
- `references/advanced.md`
## Dependencies
- Python **>= 3.8** (Biopython 1.85 supports Python 3)
- `biopython==1.85`
- `numpy>=1.20` (required by Biopython)
Install:
```bash
python -m pip install "biopython==1.85" "numpy>=1.20"
```
## Example Usage
A complete, runnable example that:
1) parses a FASTA file,
2) computes GC fraction,
3) runs a remote BLAST (optional),
4) fetches the top hit from NCBI,
5) prints basic results.
Create `example_biopython_pipeline.py`:
```python
from __future__ import annotations
import os
import time
from typing import Optional
from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction
# Optional BLAST (remote). Comment out if you do not want network calls.
from Bio.Blast import NCBIWWW, NCBIXML
def configure_entrez() -> None:
"""
NCBI requires an email. An API key increases rate limits.
Set these via environment variables to avoid hardcoding secrets.
"""
email = os.environ.get("NCBI_EMAIL")
if not email:
raise RuntimeError("Set NCBI_EMAIL env var (required by NCBI). Example: export NCBI_EMAIL='you@org.org'")
Entrez.email = email
api_key = os.environ.get("NCBI_API_KEY")
if api_key:
Entrez.api_key = api_key
def read_first_fasta_record(path: str):
with open(path, "r", encoding="utf-8") as handle:
return next(SeqIO.parse(handle, "fasta"))
def blast_top_accession(sequence: str, program: str = "blastn", database: str = "nt") -> Optional[str]:
"""
Remote BLAST can be slow and rate-limited. For large-scale BLAST, prefer local BLAST+.
"""
result_handle = NCBIWWW.qblast(program, database, sequence)
blast_record = NCBIXML.read(result_handle)
if not blast_record.alignments:
return None
# Many BLAST titles include multiple identifiers; accession is usually available directly.
return blast_record.alignments[0].accession
def fetch_fasta_by_accession(accession: str) -> str:
with Entrez.efetch(db="nucleotide", id=accession, rettype="fasta", retmode="text") as handle:
return handle.read()
def main() -> None:
configure_entrez()
record = read_first_fasta_record("input.fasta")
seq = record.seq
print(f"ID: {record.id}")
print(f"Length: {len(seq)}")
print(f"GC fraction: {gc_fraction(seq):.2%}")
# Be polite to NCBI services in batch workflows.
time.sleep(0.34)
top_acc = blast_top_accession(str(seq))
if not top_acc:
print("No BLAST hits found.")
return
print(f"Top BLAST accession: {top_acc}")
time.sleep(0.34)
fasta_text = fetch_fasta_by_accession(top_acc)
print("Top hit FASTA:")
print(fasta_text)
if __name__ == "__main__":
main()
```
Run:
```bash
export NCBI_EMAIL="your.email@example.com"
# export NCBI_API_KEY="your_ncbi_api_key" # optional
python example_biopython_pipeline.py
```
Provide an `input.fasta` in the same directory, e.g.:
```text
>demo
ATCGATCGATCGATCGATCG
```
## Implementation Details
- **Streaming I/O for large datasets**: Prefer iterator-based parsing (`SeqIO.parse`) to avoid loading entire files into memory. Use `SeqIO.read` only when exactly one record is expected.
- **Entrez configuration and rate limits**:
- Always set `Entrez.email` (NCBI requirement).
- Optionally set `Entrez.api_key` to increase request limits.
- In batch jobs, add delays (e.g., `time.sleep(0.34)` as a conservative baseline) and implement retries for transient HTTP failures.
- **BLAST considerations**:
- `NCBIWWW.qblast(...)` is convenient but can be slow and is not ideal for high-throughput workloads.
- Parse results with `NCBIXML.read(...)` (single record) or `NCBIXML.parse(...)` (multiple records).
- Filter hits by HSP metrics (e-value, identity) by iterating `alignment.hsps`.
- **Sequence statistics and transformations**:
- Use `Bio.SeqUtils.gc_fraction(seq)` for GC fraction (returns 0–1).
- Use `seq.translate(table=...)` with the correct genetic code table for reproducibility.
- **Structure parsing (if used)**:
- Use `Bio.PDB.PDBParser(QUIET=True)` to suppress warnings when appropriate.
- Navigate the SMCRA hierarchy (Structure → Model → Chain → Residue → Atom) for robust traversal and geometry calculations.
- **Reproducibility**:
- Record key parameters (file formats, translation table, BLAST program/database, e-value thresholds, NCBI query terms).
- Cache downloaded records when iterating to avoid repeated network calls.Related Skills
biopython-entrez
Use Bio.Entrez to access NCBI databases (e.g., PubMed/GenBank) for searching, fetching summaries, and downloading records when your workflow needs to call the NCBI E-utilities API over the network.
biopython-structure
Use Bio.PDB to parse and analyze protein structures (PDB/mmCIF) for structural bioinformatics tasks; use when you need structure parsing, geometry calculations, or structural comparison/superposition.
biopython-sequence-io
Use Biopython to read/write/convert biological sequence files (FASTA/GenBank/FASTQ, etc.) and perform basic sequence operations; use when you need reliable sequence I/O, lightweight sequence manipulation, or scalable processing of large sequence datasets.
biopython-phylo
Use Bio.Phylo to read/write phylogenetic trees and perform visualization and statistics; use when tree parsing/conversion, pruning/rerooting, distance calculation, or plotting is required.
biopython-alignment
Sequence alignment and alignment file processing with Biopython (Bio.Align/Bio.AlignIO), triggered when you need global/local pairwise alignment, MSA read/write/format conversion, or alignment statistics/filtering.
biopython-advanced
Advanced Biopython modules for motifs, population genetics, sequence utilities, restriction analysis, clustering, and GenomeDiagram visualization; use when you need extended bioinformatics analysis beyond basic sequence I/O and alignment.
skill-auditor
A comprehensive auditor for any agent skill — including Manus, OpenClaw/ClawHub, Claude, LobeHub, or custom SKILL.md-based skills. Use this skill whenever a user wants to evaluate, audit, review, score, or quality-check an agent skill before publishing, updating, or deploying. Covers two hard veto gates (structural redlines + research integrity redlines), static quality scoring across 25 criteria (ISO 25010 + OpenSSF + Agent), dynamic test input generation, multi-mode execution testing, multi-layer output evaluation with five specialized category rubrics (Evidence Insight / Protocol Design / Data Analysis / Academic Writing / Other), a Research Veto that applies to all four research categories, human eval viewer generation, actionable P0/P1/P2 optimization recommendations, and automatic skill improvement that outputs a polished, production-ready SKILL.md. Also use whenever a user says "audit my skill", "evaluate my skill", "improve my skill", or wants a corrected version after evaluation.
two-sample-mr-research-planner
Generates complete two-sample Mendelian randomization (MR) research designs from a user-provided research direction. Use when users want to design, plan, or build a study using two-sample MR to test causal relationships. Triggers:"design a two-sample MR study", "build a publishable MR paper", "test whether this biomarker causally affects this disease", "generate Lite/Standard/Advanced MR plans", "screen multiple exposures with MR", "bidirectional MR design", "causal inference using GWAS summary statistics", or "I want to study X and Y using MR". Always outputs four workload configurations (Lite / Standard / Advanced / Publication+) with a recommended primary plan, step-by-step workflow, figure plan, validation strategy, minimal executable version, and publication upgrade path.
research-proposal-generator
Generates a comprehensive research proposal design based on input literature, including hypothesis, mechanism verification, and budget. Use when the user wants to design a research project from a paper.
research-grants
Write competitive research proposals for NSF, NIH, DOE, DARPA, and Taiwan's NSTC when you need agency-compliant narratives, budgets, and review-criteria alignment for a specific solicitation/FOA/BAA.
protocol-standardization
Standardize fragmented experimental steps into reproducible protocol documents when you need method organization, lab SOP drafting, or cross-operator reproducibility; missing parameters must be explicitly marked as "To be supplemented/Not provided".
prospero-registration-helper
Assists researchers in generating PROSPERO registration content for meta-analyses from a title and optional protocol. Use when the user wants to draft a PROSPERO registration form.