pdb-database

Access the RCSB Protein Data Bank (PDB) to search, download, and programmatically retrieve 3D macromolecular structures and metadata; use when you need structure discovery (text/sequence/3D similarity) or automated structural data ingestion for structural biology and drug discovery workflows.

53 stars

byaipoch

View on GitHub Installation ↓

Best use case

pdb-database is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using pdb-database should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/pdb-database/SKILL.md --create-dirs "https://raw.githubusercontent.com/aipoch/medical-research-skills/main/scientific-skills/Evidence Insight/pdb-database/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/pdb-database/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How pdb-database Compares

Feature / Agent	pdb-database	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)

## When to Use

Use this skill when you need to:

- Find protein/nucleic acid 3D structures by **keywords**, **organism**, **experimental method**, or **resolution**.
- Identify related structures via **sequence similarity** (e.g., homolog search for modeling).
- Identify related structures via **3D structure similarity** (e.g., fold-level comparisons).
- **Download coordinates** (PDB/mmCIF) for downstream analysis, visualization, docking, or modeling.
- Run **batch retrieval** of metadata/coordinates to feed pipelines in drug discovery, protein engineering, or structural bioinformatics.

## Key Features

- Text and attribute-based search over RCSB PDB entries.
- Sequence similarity search with configurable thresholds (e-value, identity).
- Structure similarity search using an existing entry as a query.
- Programmatic metadata retrieval via the RCSB Data API (schema-based or GraphQL).
- Direct coordinate downloads in **PDB** and **mmCIF** formats.
- Batch processing patterns for multiple PDB IDs.

## Dependencies

- `rcsb-api` (latest recommended; provides `rcsbapi.search` and `rcsbapi.data`)
- `requests>=2.0` (HTTP downloads)
- `biopython>=1.80` (optional; parsing/analyzing PDB coordinates)

Install (example):

```bash
uv pip install rcsb-api requests biopython
```

## Example Usage

The following script is end-to-end runnable: it searches for a target, fetches metadata, downloads coordinates, and parses the structure.

```python
#!/usr/bin/env python3
import pathlib
import requests

from rcsbapi.search import TextQuery, AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info
from rcsbapi.data import fetch, Schema

from Bio.PDB import PDBParser


def download_text(url: str, out_path: pathlib.Path) -> None:
    r = requests.get(url, timeout=60)
    r.raise_for_status()
    out_path.write_text(r.text, encoding="utf-8")


def main():
    out_dir = pathlib.Path("pdb_out")
    out_dir.mkdir(exist_ok=True)

    # 1) Search: hemoglobin entries with resolution < 2.0 Å
    q_text = TextQuery("hemoglobin")
    q_res = AttributeQuery(
        attribute=rcsb_entry_info.resolution_combined,
        operator="less",
        value=2.0,
    )
    query = q_text & q_res

    pdb_ids = list(query())[:5]
    if not pdb_ids:
        raise SystemExit("No results found.")
    pdb_id = pdb_ids[0]
    print(f"Selected PDB ID: {pdb_id}")

    # 2) Fetch entry metadata
    entry = fetch(pdb_id, schema=Schema.ENTRY)
    title = entry.get("struct", {}).get("title")
    method = (entry.get("exptl") or [{}])[0].get("method")
    resolution = (entry.get("rcsb_entry_info") or {}).get("resolution_combined")
    deposit_date = (entry.get("rcsb_accession_info") or {}).get("deposit_date")

    print("Metadata:")
    print(f"  Title: {title}")
    print(f"  Method: {method}")
    print(f"  Resolution: {resolution}")
    print(f"  Deposit date: {deposit_date}")

    # 3) Download coordinates (PDB and mmCIF)
    pdb_path = out_dir / f"{pdb_id}.pdb"
    cif_path = out_dir / f"{pdb_id}.cif"

    download_text(f"https://files.rcsb.org/download/{pdb_id}.pdb", pdb_path)
    download_text(f"https://files.rcsb.org/download/{pdb_id}.cif", cif_path)
    print(f"Downloaded: {pdb_path} and {cif_path}")

    # 4) Parse PDB coordinates (example: count atoms)
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure(pdb_id, str(pdb_path))

    atom_count = sum(1 for _ in structure.get_atoms())
    chain_ids = sorted({chain.id for chain in structure.get_chains()})
    print("Parsed structure:")
    print(f"  Chains: {chain_ids}")
    print(f"  Atom count: {atom_count}")


if __name__ == "__main__":
    main()
```

## Implementation Details

### Search Modes and Query Composition

- **Text search** uses free-text matching over entry annotations (titles, keywords, descriptions).
- **Attribute search** filters by structured fields (e.g., organism, method, resolution).
- **Sequence similarity search** typically supports:
  - `evalue_cutoff`: lower is more stringent (fewer, more confident hits).
  - `identity_cutoff`: fraction identity threshold (e.g., `0.9` for near-identical).
- **Structure similarity search** uses an existing structure (e.g., an `entry_id`) as the geometric reference.
- Queries can be combined with boolean logic:
  - `query1 & query2` (AND)
  - `query1 | query2` (OR)
  - `~query` (NOT), where supported by the client

### Data Retrieval (Schema vs GraphQL)

- **Schema-based fetch** (e.g., `Schema.ENTRY`, `Schema.POLYMER_ENTITY`) is convenient for common objects and stable access patterns.
- **GraphQL fetch** is best when you need a custom selection of fields in one request (reduce round-trips and payload).

Example GraphQL pattern:

```python
from rcsbapi.data import fetch

query = """
{
  entry(entry_id: "4HHB") {
    struct { title }
    exptl { method }
    rcsb_entry_info { resolution_combined deposited_atom_count }
  }
}
"""
data = fetch(query_type="graphql", query=query)
```

### Coordinate Downloads and Formats

- **PDB**: legacy text format; widely supported but less expressive for large/complex structures.
- **mmCIF (PDBx)**: modern standard; preferred for completeness and large structures.

Direct download endpoints:

- `https://files.rcsb.org/download/{PDB_ID}.pdb`
- `https://files.rcsb.org/download/{PDB_ID}.cif`

### Batch Processing Pattern

For batch metadata retrieval, iterate over IDs and call `fetch(pdb_id, schema=Schema.ENTRY)`; handle exceptions per-ID to keep pipelines robust. For large batches, consider rate limiting and caching to avoid repeated downloads.

### Reference Documentation

If present in this repository, consult:

- `references/api_reference.md` for advanced endpoint usage, query patterns, schema notes, rate limits, and troubleshooting.

Related Skills

uspto-database

from aipoch/medical-research-skills

Access USPTO data (Patent Search, PEDS, TSDR, assignments) when you need to query patents/trademarks and retrieve prosecution or status information programmatically.

zinc-database

from aipoch/medical-research-skills

Access the ZINC (230M+ purchasable compounds) database when you need to look up compounds by ZINC ID/SMILES, run similarity/analog searches, or download 3D ready-to-dock structures for virtual screening and drug discovery.

uniprot-database

from aipoch/medical-research-skills

Direct REST API access to UniProt for protein search, entry retrieval, and identifier mapping; use when you need programmatic UniProtKB queries or cross-database ID conversion.

string-database

from aipoch/medical-research-skills

Access the STRING database to map identifiers, retrieve protein–protein interaction networks, and run functional/PPI enrichment when you need interaction context for a gene/protein set.

semantic-scholar-database

from aipoch/medical-research-skills

Access the Semantic Scholar Graph API to search papers and retrieve paper/author/citation data when you need literature discovery or citation graph exploration.

scite-database

from aipoch/medical-research-skills

Access Scite.ai Smart Citations to classify how a paper is cited (supporting, contrasting, mentioning) and assess scientific claims; use it when you need to evaluate a paper’s reliability or its acceptance in the literature.

pubchem-database-skill

from aipoch/medical-research-skills

Programmatic access to the PubChem database (via PUG-REST API and PubChemPy) for searching chemical compounds, retrieving physicochemical properties, performing structure similarity/substructure searches, and obtaining bioactivity data.

kegg-database

from aipoch/medical-research-skills

Direct access to KEGG via the REST API for academic-only pathway/gene/compound/drug queries; use when you need precise HTTP-level control or targeted KEGG ID mapping.

hmdb-database

from aipoch/medical-research-skills

Access the Human Metabolome Database (HMDB) to search metabolites by name/structure/ID and extract chemical/biological/clinical fields when you need metabolomics research data or automated HMDB XML mining.

gwas-database

from aipoch/medical-research-skills

Query the NHGRI-EBI GWAS Catalog to retrieve SNP–trait associations, study metadata, and (when available) summary statistics when you need evidence for a variant, trait/disease, gene, or genomic region.

gene-database

from aipoch/medical-research-skills

Query the NCBI Gene database via E-utilities and the NCBI Datasets API; use it when you need to search genes by symbol/ID and retrieve annotations (RefSeq, GO, location, phenotype) for single or batch gene lists.

fda-database

from aipoch/medical-research-skills

Query the openFDA API to retrieve FDA regulatory datasets (drugs, devices, adverse events, recalls, submissions, UNII) when you need programmatic safety/regulatory evidence for analysis or research.