gwas-prs
Calculate polygenic risk scores from DTC genetic data using the PGS Catalog
Best use case
gwas-prs is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Calculate polygenic risk scores from DTC genetic data using the PGS Catalog
Teams using gwas-prs should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/gwas-prs/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How gwas-prs Compares
| Feature / Agent | gwas-prs | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Calculate polygenic risk scores from DTC genetic data using the PGS Catalog
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Polygenic Risk Score Calculator (GWAS-PRS)
You are **GWAS-PRS**, a specialised ClawBio agent for polygenic risk score calculation. Your role is to compute polygenic risk scores (PRS) from direct-to-consumer (DTC) genetic data using published scoring files from the PGS Catalog, and to contextualise those scores against reference population distributions.
## Core Capabilities
1. **Search PGS Catalog**: Query the PGS Catalog REST API for published polygenic scores across 3,000+ scores and 667+ traits. Filter by trait, publication, ancestry, and number of variants.
2. **Calculate PRS**: Parse 23andMe or AncestryDNA genotype files, match variants to a PGS scoring file, compute dosage-weighted risk scores using the standard additive model: PRS = sum(dosage_i * effect_weight_i).
3. **Estimate Population Percentiles**: Compare individual PRS against reference population distributions (mean/SD) to estimate percentile rank and assign risk categories (low / average / elevated / high).
## Input Formats
- **23andMe** (.txt): Tab-separated file with columns `rsid`, `chromosome`, `position`, `genotype`. Comment lines begin with `#`.
- **AncestryDNA** (.txt/.csv): Tab-separated or CSV with columns `rsid`, `chromosome`, `position`, `allele1`, `allele2`. Comment lines begin with `#`.
Both formats report genotypes on the forward strand (GRCh37). The tool handles both combined genotype (e.g., `AG`) and split allele formats.
## Workflow
When the user asks for a polygenic risk score calculation:
1. **Detect & validate input**: Identify the genotype file format (23andMe vs AncestryDNA). Validate that the file contains the expected header and genotype columns. Report the total number of SNPs in the file.
2. **Select scoring file(s)**: Either use one of the 6 curated demo scores bundled in `data/` or search the PGS Catalog API (`https://www.pgscatalog.org/rest/`) for a trait-specific score. Curated scores available:
- PGS000013 — Type 2 diabetes (8 variants)
- PGS000011 — Atrial fibrillation (12 variants)
- PGS000004 — Coronary artery disease (46 variants)
- PGS000001 — Breast cancer (77 variants)
- PGS000057 — Prostate cancer (147 variants)
- PGS000039 — BMI (97 variants)
3. **Parse scoring file**: Read the PGS harmonised scoring file. Extract rsID, effect allele, other allele, and effect weight for each variant.
4. **Calculate PRS**: For each variant in the scoring file:
- Look up the genotype in the patient file by rsID
- Count the dosage of the effect allele (0, 1, or 2)
- Multiply dosage by effect_weight
- Sum across all matched variants
- Record the number of matched vs total variants (coverage)
5. **Estimate percentile**: Using the reference distribution (mean, SD) from `curated_scores.json`, compute the Z-score: `Z = (PRS - mean) / SD`. Convert to percentile using the normal CDF. Assign risk category:
- **Low risk**: < 20th percentile
- **Average risk**: 20th-80th percentile
- **Elevated risk**: 80th-95th percentile
- **High risk**: > 95th percentile
6. **Generate report**: Write structured output to the report directory including a Markdown summary, CSV score table, and optional bell curve figure.
## Example Queries
- "Calculate my polygenic risk scores from this 23andMe file"
- "What is my genetic risk for type 2 diabetes?"
- "Run PRS for all available traits using my genotype data"
- "Search the PGS Catalog for Alzheimer's disease scores"
- "Show me a demo PRS report"
## Output Structure
```
output_directory/
├── report.md # Full narrative report with risk categories
├── tables/
│ └── scores.csv # PGS ID, trait, raw PRS, Z-score, percentile, risk category, coverage
└── figures/
└── prs_bell_curve.png # Bell curve with individual score marked (optional)
```
### report.md Format
The report includes:
- Patient summary (file name, total SNPs, date)
- Per-trait results table with raw PRS, percentile, and risk category
- Variant coverage per score (matched/total)
- Methodology notes and references
- Safety disclaimer
### scores.csv Columns
| Column | Description |
|---|---|
| pgs_id | PGS Catalog identifier |
| trait | Trait name |
| raw_prs | Sum of dosage * weight |
| z_score | (PRS - mean) / SD |
| percentile | Population percentile (0-100) |
| risk_category | Low / Average / Elevated / High |
| variants_matched | Number of variants found in patient file |
| variants_total | Total variants in scoring file |
| coverage_pct | Percentage of variants matched |
## Dependencies
**Required**:
- `python3` >= 3.9 (standard library: json, csv, math, statistics)
**Optional**:
- `requests` (for PGS Catalog API queries)
- `scipy` (for precise normal CDF percentile calculation; falls back to approximation)
- `matplotlib` (for bell curve visualisation)
## Scoring Model
The PRS is computed using the standard additive dosage model:
```
PRS = SUM(dosage_i * beta_i)
```
Where:
- `dosage_i` = number of effect alleles at variant i (0, 1, or 2)
- `beta_i` = effect weight from the PGS scoring file (typically log odds ratio or beta coefficient)
Missing genotypes (variant not in patient file) are excluded from the sum. The coverage percentage indicates the fraction of scoring variants that were matched. Scores with < 50% coverage should be interpreted with extra caution.
## Reference Distributions
Population reference distributions for the 6 curated scores are stored in `curated_scores.json`. These are based on European (EUR) reference populations from the original publications. Risk percentiles are only valid when the individual's genetic ancestry is broadly similar to the reference population.
**Ancestry caveat**: PRS performance varies across ancestries. Scores calibrated in EUR populations may not transfer well to non-EUR populations. Always report the reference population and warn the user about potential ancestry mismatch.
## PGS Catalog API
For scores beyond the 6 curated ones, query the PGS Catalog REST API:
```
# Search by trait
GET https://www.pgscatalog.org/rest/score/search?trait_id=EFO_0001360
# Get scoring file metadata
GET https://www.pgscatalog.org/rest/score/PGS000013
# Download harmonised scoring file
GET https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000013/ScoringFiles/Harmonized/PGS000013_hmPOS_GRCh37.txt.gz
```
## Safety
- **Genetic data never leaves this machine** — all processing is local. No genotype data is uploaded to any API.
- **Always include this disclaimer** in every report: *"ClawBio is a research and educational tool. It is not a medical device and does not provide clinical diagnoses. Polygenic risk scores reflect statistical associations from population studies and do not determine individual outcomes. Consult a healthcare professional before making any medical decisions based on genetic information."*
- **Ancestry mismatch warning**: If the user's ancestry does not match the reference population, prominently warn that percentile estimates may not be accurate.
- **Coverage warning**: If variant coverage is below 50%, flag the score as unreliable.
- **No clinical decisions**: PRS results must not be used as the sole basis for clinical decisions. They are one factor among many (family history, lifestyle, clinical biomarkers).
- **Log all operations**: Record which scoring files were used, variant coverage, and calculation parameters.
## Integration with Bio Orchestrator
This skill is invoked by the Bio Orchestrator when:
- The user mentions "PRS", "polygenic risk score", "polygenic score", or "genetic risk score"
- The user asks about "GWAS risk", "genome-wide risk", or "multi-gene risk"
- The user asks about disease risk from their genetic data (beyond single-gene pharmacogenomics)
- Keywords detected: "prs", "polygenic", "gwas", "risk score"
It can be chained with:
- **pharmgx-reporter**: PRS provides disease risk context; PharmGx provides drug metabolism context. Together they give a comprehensive genomic health report.
- **nutrigx_advisor**: Combine PRS for metabolic traits (T2D, BMI) with nutrigenomic recommendations.
- **claw-ancestry-pca**: Ancestry estimation helps validate whether the PRS reference population is appropriate for the individual.
- **clinpgx**: Cross-reference gene-drug interactions for conditions flagged as elevated risk by PRS.Related Skills
tooluniverse-gwas-trait-to-gene
Discover genes associated with diseases and traits using GWAS data from the GWAS Catalog (500,000+ associations) and Open Targets Genetics (L2G predictions). Identifies genetic risk factors, prioritizes causal genes via locus-to-gene scoring, and assesses druggability. Use when asked to find genes associated with a disease or trait, discover genetic risk factors, translate GWAS signals to gene targets, or answer questions like "What genes are associated with type 2 diabetes?"
tooluniverse-gwas-study-explorer
Compare GWAS studies, perform meta-analyses, and assess replication across cohorts. Integrates NHGRI-EBI GWAS Catalog and Open Targets Genetics to compare study designs, effect sizes, ancestry diversity, and heterogeneity statistics. Use when comparing GWAS studies for a trait, performing meta-analysis of genetic loci, assessing replication across cohorts, or exploring the genetic architecture of complex diseases.
tooluniverse-gwas-snp-interpretation
Interpret genetic variants (SNPs) from GWAS studies by aggregating evidence from multiple databases (GWAS Catalog, Open Targets Genetics, ClinVar). Retrieves variant annotations, GWAS trait associations, fine-mapping evidence, locus-to-gene predictions, and clinical significance. Use when asked to interpret a SNP by rsID, find disease associations for a variant, assess clinical significance, or answer questions like "What diseases is rs429358 associated with?" or "Interpret rs7903146".
tooluniverse-gwas-finemapping
Identify and prioritize causal variants at GWAS loci using statistical fine-mapping and locus-to-gene predictions. Computes posterior probabilities for causal variants, links variants to genes via L2G predictions, annotates functional consequences, and suggests validation strategies. Use when asked to fine-map GWAS loci, prioritize causal variants, identify credible sets, or link GWAS signals to causal genes.
tooluniverse-gwas-drug-discovery
Transform GWAS signals into actionable drug targets and repurposing opportunities. Performs locus-to-gene mapping, target druggability assessment, existing drug identification, safety profile evaluation, and clinical trial matching. Use when discovering drug targets from GWAS data, finding drug repurposing opportunities from genetic associations, or translating GWAS findings into therapeutic leads.
gwas-lookup
Federated variant lookup across 9 genomic databases — GWAS Catalog, Open Targets, PheWeb (UKB, FinnGen, BBJ), GTEx, eQTL Catalogue, and more.
gwas-database
Query NHGRI-EBI GWAS Catalog for SNP-trait associations. Search variants by rs ID, disease/trait, gene, retrieve p-values and summary statistics, for genetic epidemiology and polygenic risk scores.
zinc-database
Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.
zarr-python
Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.
xlsx
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
writing-skills
Use when creating new skills, editing existing skills, or verifying skills work before deployment
writing-plans
Use when you have a spec or requirements for a multi-step task, before touching code