gwas-database
Query the NHGRI-EBI GWAS Catalog to retrieve SNP–trait associations, study metadata, and (when available) summary statistics when you need evidence for a variant, trait/disease, gene, or genomic region.
Best use case
gwas-database is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Query the NHGRI-EBI GWAS Catalog to retrieve SNP–trait associations, study metadata, and (when available) summary statistics when you need evidence for a variant, trait/disease, gene, or genomic region.
Teams using gwas-database should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/gwas-database/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How gwas-database Compares
| Feature / Agent | gwas-database | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Query the NHGRI-EBI GWAS Catalog to retrieve SNP–trait associations, study metadata, and (when available) summary statistics when you need evidence for a variant, trait/disease, gene, or genomic region.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)
## When to Use
Use this skill when you need to:
1. **Look up a specific variant (rsID)** to see all reported trait/disease associations and their p-values/effect sizes.
2. **Find variants associated with a trait/disease** (via free text or an EFO trait ID) for downstream interpretation or reporting.
3. **Perform gene-centric exploration** to identify GWAS hits within/near a gene of interest.
4. **Retrieve study-level metadata** (GCST accession, PMID, cohorts, ancestry, sample size) to assess evidence quality and applicability.
5. **Access or filter summary statistics** (when available) for genome-wide analyses (e.g., fine-mapping, colocalization, PRS development).
## Key Features
- **Multiple query entry points**: rsID, EFO trait ID, gene symbol, chromosomal region, GCST accession, PMID.
- **Structured entities**: studies, associations, variants (SNPs), and traits (EFO-mapped).
- **Programmatic access** via:
- GWAS Catalog REST API: `https://www.ebi.ac.uk/gwas/rest/api`
- Summary Statistics API: `https://www.ebi.ac.uk/gwas/summary-statistics/api`
- **Association-level fields** commonly used in analysis: p-value, strongest allele, odds ratio/beta, mapped trait labels.
- **Pagination support** for bulk extraction (`page`, `size`, and `_links` navigation).
## Dependencies
- Python **3.9+**
- `requests` **>= 2.31.0**
- `pandas` **>= 2.0.0** (optional; for tabular outputs)
## Example Usage
The following script is a complete, runnable example that:
1) fetches associations for an EFO trait,
2) filters by genome-wide significance,
3) returns a tidy table.
```python
import time
import requests
import pandas as pd
GWAS_REST_BASE = "https://www.ebi.ac.uk/gwas/rest/api"
def fetch_trait_associations(efo_id: str, page_size: int = 100, max_pages: int = 50):
"""
Fetch associations for a given EFO trait ID from the GWAS Catalog REST API.
Returns a list of association JSON objects.
"""
url = f"{GWAS_REST_BASE}/efoTraits/{efo_id}/associations"
headers = {"Accept": "application/json"}
all_assocs = []
for page in range(max_pages):
params = {"page": page, "size": page_size}
r = requests.get(url, params=params, headers=headers, timeout=60)
r.raise_for_status()
data = r.json()
assocs = data.get("_embedded", {}).get("associations", [])
if not assocs:
break
all_assocs.extend(assocs)
time.sleep(0.1) # be polite to the public API
return all_assocs
def to_table(assocs, p_threshold: float = 5e-8) -> pd.DataFrame:
rows = []
for a in assocs:
p = a.get("pvalue")
try:
p_float = float(p) if p is not None else None
except (TypeError, ValueError):
p_float = None
if p_float is None or p_float > p_threshold:
continue
rows.append({
"rsId": a.get("rsId"),
"trait": a.get("efoTrait") or a.get("mappedLabel"),
"pvalue": p_float,
"strongestAllele": a.get("strongestAllele"),
"orPerCopyNum": a.get("orPerCopyNum"),
"betaNum": a.get("betaNum"),
"pubmedId": a.get("pubmedId"),
"studyAccession": a.get("studyAccession"),
})
df = pd.DataFrame(rows).drop_duplicates()
if not df.empty:
df = df.sort_values("pvalue", ascending=True).reset_index(drop=True)
return df
if __name__ == "__main__":
# Example: Type 2 diabetes (EFO_0001360)
efo_id = "EFO_0001360"
assocs = fetch_trait_associations(efo_id)
df = to_table(assocs, p_threshold=5e-8)
print(df.head(20).to_string(index=False))
print(f"\nSignificant associations: {len(df)}")
if not df.empty:
print(f"Unique variants: {df['rsId'].nunique()}")
```
## Implementation Details
### Data Model and Identifiers
- **Study accession**: `GCST...` (e.g., `GCST001234`)
- **Variant identifier**: `rs...` (e.g., `rs7903146`)
- **Trait identifier**: **EFO** term (e.g., `EFO_0001360`)
- **Gene symbol**: HGNC-approved symbol (e.g., `APOE`, `TCF7L2`)
### Core Endpoints (REST API)
- Study details: `GET /studies/{GCST}`
- Variant details: `GET /singleNucleotidePolymorphisms/{rsId}`
- Variant associations: `GET /singleNucleotidePolymorphisms/{rsId}/associations`
- Trait associations: `GET /efoTraits/{EFO}/associations`
### Pagination Strategy
- Most list endpoints are paginated.
- Use query parameters:
- `size`: number of records per page (commonly 20–100)
- `page`: zero-based page index
- Stop conditions:
- `_embedded.associations` is empty, or
- you reach a predefined `max_pages` safety limit.
### Significance Thresholds and Filtering
- A common GWAS threshold is **p ≤ 5×10⁻⁸** (genome-wide significance).
- Filtering should be applied after parsing `pvalue` into a numeric type; handle missing or non-numeric values safely.
### Summary Statistics Access (when available)
- Summary Statistics API base: `https://www.ebi.ac.uk/gwas/summary-statistics/api`
- Typical filters include chromosome/position ranges and p-value bounds (endpoint availability and parameters may vary by resource version).
- For bulk downloads, the Catalog also provides an FTP directory:
- `http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/`
### Practical Notes for Robust Use
- Respect public API usage (add small delays; cache results for iterative workflows).
- Always interpret associations in context:
- ancestry/cohort metadata,
- sample size,
- replication status,
- effect size harmonization needs across studies.Related Skills
uspto-database
Access USPTO data (Patent Search, PEDS, TSDR, assignments) when you need to query patents/trademarks and retrieve prosecution or status information programmatically.
zinc-database
Access the ZINC (230M+ purchasable compounds) database when you need to look up compounds by ZINC ID/SMILES, run similarity/analog searches, or download 3D ready-to-dock structures for virtual screening and drug discovery.
uniprot-database
Direct REST API access to UniProt for protein search, entry retrieval, and identifier mapping; use when you need programmatic UniProtKB queries or cross-database ID conversion.
string-database
Access the STRING database to map identifiers, retrieve protein–protein interaction networks, and run functional/PPI enrichment when you need interaction context for a gene/protein set.
semantic-scholar-database
Access the Semantic Scholar Graph API to search papers and retrieve paper/author/citation data when you need literature discovery or citation graph exploration.
scite-database
Access Scite.ai Smart Citations to classify how a paper is cited (supporting, contrasting, mentioning) and assess scientific claims; use it when you need to evaluate a paper’s reliability or its acceptance in the literature.
pubchem-database-skill
Programmatic access to the PubChem database (via PUG-REST API and PubChemPy) for searching chemical compounds, retrieving physicochemical properties, performing structure similarity/substructure searches, and obtaining bioactivity data.
pdb-database
Access the RCSB Protein Data Bank (PDB) to search, download, and programmatically retrieve 3D macromolecular structures and metadata; use when you need structure discovery (text/sequence/3D similarity) or automated structural data ingestion for structural biology and drug discovery workflows.
kegg-database
Direct access to KEGG via the REST API for academic-only pathway/gene/compound/drug queries; use when you need precise HTTP-level control or targeted KEGG ID mapping.
hmdb-database
Access the Human Metabolome Database (HMDB) to search metabolites by name/structure/ID and extract chemical/biological/clinical fields when you need metabolomics research data or automated HMDB XML mining.
gene-database
Query the NCBI Gene database via E-utilities and the NCBI Datasets API; use it when you need to search genes by symbol/ID and retrieve annotations (RefSeq, GO, location, phenotype) for single or batch gene lists.
fda-database
Query the openFDA API to retrieve FDA regulatory datasets (drugs, devices, adverse events, recalls, submissions, UNII) when you need programmatic safety/regulatory evidence for analysis or research.