Skill: Population Genetics Analysis

**MC Strategy**: Population genetics MC questions often test whether you know a specific theorem or result. COMPUTE the answer first (use popgen_calculator.py or write Python), then match to options. Don't try to reason about which option "sounds right."

1,202 stars

bymims-harvard

View on GitHub Installation ↓

Best use case

Skill: Population Genetics Analysis is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using Skill: Population Genetics Analysis should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/tooluniverse-population-genetics/SKILL.md --create-dirs "https://raw.githubusercontent.com/mims-harvard/ToolUniverse/main/skills/tooluniverse-population-genetics/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/tooluniverse-population-genetics/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Skill: Population Genetics Analysis Compares

Feature / Agent	Skill: Population Genetics Analysis	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Best AI Skills for ChatGPT

Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.

AI Agent for Product Research

Browse AI agent skills for product research, competitive analysis, customer discovery, and structured product decision support.

SKILL.md Source

# Skill: Population Genetics Analysis

**MC Strategy**: Population genetics MC questions often test whether you know a specific theorem or result. COMPUTE the answer first (use popgen_calculator.py or write Python), then match to options. Don't try to reason about which option "sounds right."

Analyze population-level genetic variation, allele frequencies, GWAS associations, clinical significance, and evolutionary constraints using ToolUniverse tools.

## When to Use

Activate this skill when the user asks about:
- Allele frequencies across populations (gnomAD, 1000 Genomes)
- GWAS associations for diseases/traits
- Clinical variant interpretation (ClinVar, VEP)
- Gene-level constraint metrics (pLI, LOEUF, o/e ratios)
- Selection, drift, linkage disequilibrium, or population structure
- Variant annotation and functional consequences

## LOOK UP, DON'T GUESS

Query gnomAD/1000Genomes/GWAS Catalog FIRST for allele frequencies and associations. Preferred: use the `PopGen_hwe_test`, `PopGen_fst`, `PopGen_inbreeding`, and `PopGen_haplotype_count` tools for HWE, Fst, inbreeding, and haplotype calculations. Fallback: run `popgen_calculator.py` directly. For theoretical problems (delta-q, drift, LD decay), apply the formulas in the Theoretical Reasoning section below.

---

## Tool Quick Reference

| Tool | Key Parameters | Notes |
|------|---------------|-------|
| `gnomad_search_variants` | `query` (REQUIRED) | Resolve rsID to variant_id format "CHR-POS-REF-ALT" |
| `gnomad_get_variant` | `variant_id` (REQUIRED), `dataset` | Population frequencies. Default dataset: gnomad_r3; use gnomad_r4 for latest |
| `gnomad_get_gene_constraints` | `gene_symbol` (REQUIRED) | pLI, o/e ratios. May timeout -- retry once |
| `MyVariant_query_variants` | `query` (REQUIRED) | Aggregated: ClinVar + dbSNP + gnomAD + CADD. Uses hg19 coordinates |
| `EnsemblVEP_annotate_rsid` | `variant_id` (REQUIRED) | Functional consequence, SIFT, PolyPhen. Param is "variant_id" NOT "rsid" |
| `EnsemblVEP_variant_recoder` | `variant_id` (REQUIRED) | Convert between rsID/HGVS/VCF/SPDI |
| `gwas_get_snps_for_gene` | `gene_symbol` (REQUIRED) | All GWAS SNPs for a gene |
| `gwas_search_associations` | `query` (REQUIRED) | GWAS for a disease/trait (NOT gene name -- use gwas_get_snps_for_gene for genes) |
| `gwas_get_variants_for_trait` | `trait` (REQUIRED) | Variants associated with a trait |
| `ClinVar_search_variants` | `gene`, `condition`, `significance` | At least one filter required |
| `RegulomeDB_query_variant` | `rsid` (REQUIRED) | Regulatory scoring (1a=strongest to 7=minimal) |

### Critical Gotchas

1. **gnomAD variant_id**: Format is `"CHR-POS-REF-ALT"` (no "chr" prefix). Always resolve rsIDs via `gnomad_search_variants` first.
2. **gwas_search_associations**: Takes disease/trait names ONLY. Gene names will fail. Use `gwas_get_snps_for_gene` for gene-based lookups.
3. **gwas_search_snps**: BROKEN (HTTP 500). Use `gwas_get_snps_for_gene` instead.
4. **VEP/ClinVar responses**: Format is variable (list, `{data, metadata}`, or `{error}`). Handle all three.

---

## Workflow Patterns

**Variant frequency**: `gnomad_search_variants` -> `gnomad_get_variant(dataset="gnomad_r4")` -> `MyVariant_query_variants` (1000G pop breakdowns) -> `EnsemblVEP_annotate_rsid`

**GWAS for disease**: `gwas_search_associations` -> `gwas_get_variants_for_trait` -> `gnomad_get_variant` for top hits -> `EuropePMC_search_articles`

**Gene characterization**: `gnomad_get_gene_constraints` -> `gwas_get_snps_for_gene` -> `ClinVar_search_variants` -> `PubMed_search_articles`

**Pathogenicity assessment**: `EnsemblVEP_annotate_rsid` -> `MyVariant_query_variants` (CADD, ClinVar) -> `gnomad_get_variant` (frequency) -> `RegulomeDB_query_variant` (if non-coding)

---

## Theoretical Reasoning (CRITICAL for computation problems)

These formulas are needed for quantitative population genetics problems. Work through step by step, showing intermediate values.

### Allele Frequency Change Under Selection (delta-q)

For a recessive deleterious allele (fitness: AA=1, Aa=1, aa=1-s):
```
delta_q = -s * q^2 * p / (1 - s * q^2)
```
where p = freq(A), q = freq(a), s = selection coefficient.

For dominant deleterious (AA=1, Aa=1-s, aa=1-s):
```
delta_q = -s * q * p / (1 - s * q * (2 - q))
```

For heterozygote advantage (AA=1-s1, Aa=1, aa=1-s2):
```
equilibrium: q_hat = s1 / (s1 + s2)
```
Example: plug in s1 and s2 from the question; q_hat = s1/(s1+s2).

**Selection against recessives is slow at low q** because most a alleles hide in heterozygotes. Time to reduce q from q0 to qt: t ~ (1/qt - 1/q0) / s generations.

### Genetic Drift in Small Populations

**Variance in allele frequency per generation**: Var(delta_p) = p*q / (2*Ne)

**Probability of fixation** of a new neutral mutation: 1/(2*Ne)

**Time to fixation** (given it fixes): ~4*Ne generations for neutral alleles

**Heterozygosity decay**: H_t = H_0 * (1 - 1/(2*Ne))^t

After t generations, fraction of heterozygosity lost ~ 1 - e^(-t/(2*Ne))

**Effective population size (Ne)** adjustments:
- Unequal sex ratio: Ne = 4*Nf*Nm / (Nf + Nm)
- Fluctuating size: Ne = harmonic mean of N across generations
- Bottleneck: dominated by the smallest generation size

**Drift vs selection**: Drift dominates when |s| < 1/(2*Ne). A variant with s=0.01 behaves neutrally in a population of Ne < 50.

### Linkage Disequilibrium (LD) Decay

**D** = freq(AB) - freq(A)*freq(B), where A and B are alleles at two loci.

**Decay with recombination**: D_t = D_0 * (1 - r)^t, where r = recombination fraction, t = generations.

**Half-life of LD**: t_half = -ln(2) / ln(1-r) ~ 0.693/r generations (for small r).

**r-squared** (normalized LD): r^2 = D^2 / (pA * pa * pB * pb). Range 0-1.

**Expected r^2 in finite population at equilibrium**: E[r^2] = 1 / (1 + 4*Ne*r) (for drift-recombination balance).

**Practical implications**:
- Tightly linked loci (r < 0.01): LD persists for hundreds of generations
- Loosely linked (r = 0.5, independent assortment): LD halves every generation
- GWAS tag SNPs work because LD extends over blocks; block size depends on Ne and recombination rate
- African populations have shorter LD blocks (larger historical Ne) -> need denser SNP arrays

### Hardy-Weinberg Equilibrium

For alleles A (freq p) and a (freq q=1-p): expected genotypes AA=p^2, Aa=2pq, aa=q^2.

**Chi-square test**: df=1 (2 alleles). Preferred: use `PopGen_hwe_test` tool. Fallback: `popgen_calculator.py --type hwe --AA N1 --Aa N2 --aa N3`.

**Causes of HWE departure**: non-random mating, selection, migration, drift, genotyping error. Excess homozygotes -> inbreeding or population structure (Wahlund effect). Excess heterozygotes -> overdominant selection or negative assortative mating.

### Heritability

- **H^2 (broad-sense)** = V_G / V_P; **h^2 (narrow-sense)** = V_A / V_P
- V_G includes ALL genetic variance: additive + dominance + epistasis. Trap: "broad-sense" is not just additive.
- Under HWE with two alleles (p, q): genotype frequencies are p^2, 2pq, q^2
- Phenotype frequency from genotype: sum(genotype_freq * penetrance) for each genotype class
- For quantitative traits: V_P = V_G + V_E (no covariance assumed)
- With dominance: assign genotypic values (e.g., AA=a, Aa=d, aa=-a), compute mean, then V_G from freq-weighted squared deviations
- **PGS vs SNP-h² trap**: PGS R² is NOT necessarily ≤ h²_SNP. With large GWAS, PGS can exceed SNP-h² by tagging rare causal variants through LD with common SNPs. The word "necessarily" makes this claim False. h²_SNP is estimated from common variants; PGS can capture additional variance.

### Path Analysis (Causal Diagrams)

- Trace ALL paths from cause to effect through the diagram (direct + indirect)
- Each path's contribution = product of path coefficients along that path
- Total effect (correlation) = sum of contributions from all paths
- Indirect effects can mask (suppression) or inflate (confounding) the direct effect
- Unanalyzed correlations (double-headed arrows) count as valid path segments
- **Never ignore indirect paths** — the total is rarely just the direct arrow

### Genetic Combinatorics (F2 crosses, haplotype counting)

For n SNPs between two inbred (homozygous) strains:
- F1 is heterozygous at all n loci
- F2 distinct haplotypes = 2^n (each SNP contributes parental A or B allele)
- F2 distinct diploid genotypes = 3^n (AA, AB, BB at each locus)
- F2 unique chromosomes (distinct haplotypes) = 2^n (e.g., 5 SNPs → 2^5 = 32; but subtract the 2 parental haplotypes if "novel" is asked → 30)
- **ALWAYS write and run Python code** (`python3 -c "..."`) for these counts. Never enumerate by hand.
- For specimens/counting from field data: parse the data into a structure and compute programmatically.

### Mutation-Selection Balance

Equilibrium frequency of a deleterious allele:
- Recessive lethal: q_hat = sqrt(mu/s) ~ sqrt(mu) when s=1
- Dominant lethal: q_hat = mu/s
- Example: mu=1e-5, s=1 (recessive lethal) -> q_hat = 0.003 (carrier freq ~ 0.006)

### F-statistics and Population Structure

- **Fis**: Inbreeding within subpopulations (heterozygote deficit within demes)
- **Fst**: Differentiation between subpopulations. Fst = Var(p) / (p_bar * q_bar)
- **Fit**: Total inbreeding. (1-Fit) = (1-Fis)(1-Fst)
- Fst interpretation: <0.05 little, 0.05-0.15 moderate, 0.15-0.25 great, >0.25 very great differentiation
- Preferred: use `PopGen_fst` tool. Fallback: `popgen_calculator.py --type fst --p1 X --p2 Y --n1 N1 --n2 N2`

---

## Mendelian Genetics Reasoning Framework

For any genetics cross problem, follow these steps IN ORDER. Do not skip steps.

### Step 1: Identify genes, locations, and allele relationships
- List every gene involved in the cross
- Determine chromosomal location: autosomal vs X-linked (X-linked genes show different inheritance in males vs females)
- Determine allele relationships: dominant/recessive, codominant, incomplete dominance
- Note any epistasis, suppressor, or modifier interactions between genes

### Step 2: Write parental genotypes explicitly
- Use standard notation (e.g., Aa Bb for autosomal; X^w X^+ for X-linked)
- For X-linked genes, males are hemizygous (X^w Y), not homozygous
- If parental genotypes are not given, deduce them from phenotypes and pedigree context

### Step 3: Draw Punnett square(s) for each gene
- For multi-gene crosses, handle each gene independently (if unlinked) then combine
- For linked genes, use recombination frequency to adjust gamete ratios
- For X-linked genes, remember that fathers pass X to all daughters and Y to all sons

### Step 4: Calculate expected phenotypic ratios
- Multiply independent gene ratios (e.g., 3:1 x 3:1 = 9:3:3:1)
- For X-linked: calculate male and female ratios separately, then combine or report separately as required

### Step 5: Verify ratios sum to 1.0
- Convert all ratios to fractions and confirm they sum to 1
- If they don't sum to 1, there is an error in the Punnett square or gamete calculation

### Step 6: Apply phenotype modification rules AFTER computing genotypic ratios
- For epistasis: first compute the full genotypic ratios (e.g., 9:3:3:1), then collapse genotype classes that produce the same phenotype
- For suppressor genes: a suppressor homozygote (su/su) restores wild-type in an otherwise mutant background. Apply suppression AFTER determining which individuals carry the mutant allele
- Example: 9 A_B_ : 3 A_bb : 3 aaB_ : 1 aabb with recessive epistasis (aa masks B) becomes 9:3:4

---

## E. coli Hfr Mapping Framework

For bacterial conjugation and Hfr mapping problems:

### Core Principles
- In Hfr x F- crosses, the Hfr chromosome is transferred linearly starting from the origin of transfer (oriT)
- **Gene transfer order = chromosomal order from the origin**
- Early markers (entering first) are closest to the origin of transfer
- Late markers (entering last) are farthest from the origin

### Interrupted Mating Experiments
- Genes that appear in recombinants at earlier time points are closer to oriT
- The time of entry gives the order and approximate distance between genes
- Recombinants require integration by homologous recombination (double crossover)

### Recombination Frequency Between Markers
- **KEY TRAP**: Highest recombination frequency occurs between markers that are FARTHEST APART on the transferred segment
- This is because more time elapses between entry of distant markers, providing more opportunity for recombination events between them
- Conversely, markers that enter close together in time show LOW recombination between them
- Do NOT confuse "highest recombination frequency" with "first markers to enter" -- these are opposite concepts

### Ordering Markers from Hfr Data
1. Use time-of-entry data to establish gene order relative to oriT
2. Use recombination frequency data between pairs of selected markers to confirm/refine order
3. Multiple Hfr strains with different origins can be used to build a circular map

---

## MCQ Elimination Strategy for Genetics

### General MCQ Protocol
1. **ALWAYS evaluate ALL options** before choosing an answer
2. Never select the first option that seems correct -- there may be a better or more precise answer
3. Read the question stem carefully for qualifiers: "MOST likely", "LEAST likely", "NOT true", "ALWAYS", "NEVER"

### "Which is NOT true" Questions
- Evaluate EACH statement independently as True or False
- Mark each option with T or F before selecting
- The answer is the statement marked F
- Double-check: verify the "false" statement is genuinely false, not just misleadingly worded

### "Which mechanism" Questions
- Test each proposed mechanism against ALL observations given in the question
- A correct mechanism must explain every observation, not just some
- Eliminate mechanisms that contradict even one observation

### Specific Traps to Watch For
- **Subfunctionalization vs neofunctionalization**: Subfunctionalization = partitioning of EXISTING ancestral functions between duplicates (both copies needed to perform original function). Neofunctionalization = one copy acquires a genuinely NEW function not present in the ancestor
- **Copy-neutral LOH**: Caused by mitotic recombination (segmental, affects part of a chromosome), NOT uniparental disomy (UPD, which is whole-chromosome). The question may try to conflate these
- **Penetrance vs expressivity**: Penetrance = fraction of individuals with genotype who show ANY phenotype. Expressivity = degree/severity of phenotype among those who show it. These are distinct concepts
- **Complementation vs recombination**: Complementation = two mutations in DIFFERENT genes restore wild-type in trans. Recombination = exchange between two mutations in the SAME or different genes. Complementation is tested in F1 (heterozygote); recombination is tested in progeny

---

## Common Genetics Reasoning Traps

These are specific patterns that have caused reasoning failures in hard genetics questions. Review before answering genetics MCQs.

### Suppressor Genetics
- A suppressor mutation, when homozygous, restores wild-type phenotype in an otherwise mutant background
- In F2 crosses involving both the original mutation and an autosomal recessive suppressor:
  - Treat as a dihybrid cross — the primary mutation and the suppressor segregate independently
  - Only 1/4 of F2 are homozygous for the suppressor
  - The suppressor only acts in individuals that are also homozygous for the primary mutation
  - Use a Punnett square to enumerate all genotypic classes, then apply the suppression rule to determine phenotypes

### Non-disjunction (Bridges' Experiments)
- Bridges used non-disjunction to prove the chromosome theory of inheritance
- X0 males arise from female meiosis non-disjunction events
- **Meiosis I non-disjunction**: both X chromosomes go to one pole -> XX egg + O egg (nullo-X)
- **Meiosis II non-disjunction**: sister chromatids fail to separate -> XX egg from one secondary oocyte
- The classic Bridges result: exceptional white-eyed females (X^w X^w) and red-eyed males (from nullo-X eggs + Y sperm = X0, but these are typically sterile)
- Key distinction: know which type of non-disjunction (MI vs MII) produces which specific gamete types

### GWAS LD Blocks
- SNPs WITHIN the same LD block are correlated and can inflate false positive associations (one causal SNP drags along non-causal tag SNPs)
- SNPs ACROSS different LD blocks are largely independent and do NOT create misleading cross-locus associations
- LD block structure varies by population (shorter in African populations due to larger historical Ne)
- Fine-mapping within an LD block is needed to distinguish the causal variant from hitchhiking tag SNPs

### Gene Retention After Whole-Genome Duplication
- **Neofunctionalization**: One copy acquires a NEW function -> most commonly cited reason for gene RETENTION after duplication (preserves both copies because each is now essential)
- **Subfunctionalization**: Ancestral functions are PARTITIONED between copies -> explains DIVERGENCE of duplicate copies, but both copies must be retained to maintain the full ancestral function
- **Dosage balance**: Some genes are retained in duplicate to maintain stoichiometric balance in protein complexes
- Trap: Questions may ask "what explains retention" vs "what explains divergence" -- these have different best answers
- For retention: neofunctionalization (new function makes both copies essential)
- For divergence of expression/function: subfunctionalization (partitioning of ancestral roles)

---

## Advanced Genetics Traps v2

### PGS vs Heritability: "Necessarily True" Logic
For "necessarily true" questions about PGS and heritability: a statement is necessarily true only if it holds when V_D=0 AND when V_D=V_G. Test the extremes.

### Path Diagram Sign Assignment Protocol
Do NOT guess path signs from general knowledge. Signs may differ from well-known systems. Follow this protocol:
1. **Establish reference direction**: What varies? What is increasing?
2. **For each path X→Y**: Ask ONLY "when X increases, does Y increase (+) or decrease (-)?"
3. **Use the question's experimental context** (knockout/control comparisons, provided data) to determine signs — not intuition
4. **Expect negative paths**: Path diagrams test your ability to identify negative relationships. All-positive is almost always wrong. Direct residual paths (e) often have opposite sign from expectation.

### Chi-Square: "Most Likely to Reject" Protocol
Compute chi-square from the expected ratio given in the question. Compare to chi-square-critical at df = (number of phenotype classes - 1). Pick the answer choice with the highest chi-square, but also check which pattern is biologically diagnostic of the alternative hypothesis.

### LD and Misleading GWAS Associations
LD block boundaries at recombination hotspots are a source of GWAS false localization — strong signal in the block does not guarantee the causal variant is in the block.

### Low-Frequency Allele Detection
Duplex sequencing (unique molecular identifiers + double-strand consensus) detects alleles at 0.01% frequency — far below standard NGS even at 80X depth. Simply increasing read depth does NOT help for ultra-rare variants because the Illumina error rate (~0.1%) masks variants rarer than ~1% regardless of depth. Error correction methods (UMIs, duplex consensus) are needed to distinguish true rare variants from sequencing errors.

---

## Bundled Computation Script

**Script**: `skills/tooluniverse-population-genetics/scripts/popgen_calculator.py`

**Preferred**: Use ToolUniverse tools (via MCP/SDK) instead of the script when possible:
- `PopGen_hwe_test` tool -- HWE chi-square test. Fallback: `popgen_calculator.py --type hwe`
- `PopGen_fst` tool -- Weir-Cockerham Fst. Fallback: `popgen_calculator.py --type fst`
- `PopGen_inbreeding` tool -- Inbreeding coefficient from pedigree. Fallback: `popgen_calculator.py --type inbreeding`
- `PopGen_haplotype_count` tool -- Expected haplotype diversity. Fallback: `popgen_calculator.py --type haplotypes`

**Fallback script** modes (all require `--type`):
- `hwe`: `--AA N --Aa N --aa N` -- chi-square HWE test with p-value
- `fst`: `--p1 F --p2 F --n1 N --n2 N` -- Weir-Cockerham Fst
- `inbreeding`: `--pedigree TYPE --generations G` -- F from pedigree (self, full-sib, half-sib, first-cousin, etc.)
- `haplotypes`: `--snps N --generations G --recomb_rate R` -- expected haplotype diversity

---

## Key Concepts

- **MAF**: Minor allele frequency. Common: >5%. Rare: <1%. Ultra-rare: <0.01%.
- **pLI**: P(LoF intolerant). >0.9 = haploinsufficient gene.
- **LOEUF**: LoF o/e upper fraction. <0.35 = highly constrained.
- **CADD PHRED**: >=10 top 10%, >=20 top 1%, >=30 top 0.1% most deleterious.
- **Genome-wide significance**: GWAS p < 5e-8 (Bonferroni for ~1M independent tests).
- **Effect size**: OR > 1 = risk allele, < 1 = protective. Beta > 0 = increases trait.

## Evidence Grading

- **T1**: ClinVar pathogenic/likely pathogenic, FDA pharmacogenomics
- **T2**: gnomAD frequencies, GTEx eQTLs, GWAS genome-wide significant
- **T3**: CADD/SIFT/PolyPhen predictions, RegulomeDB, constraint metrics
- **T4**: VEP consequence terms, dbSNP annotations, literature mentions

Related Skills

tooluniverse-variant-analysis

1202

from mims-harvard/ToolUniverse

Production-ready VCF processing, variant annotation, mutation analysis, and structural variant (SV/CNV) interpretation for bioinformatics questions. Parses VCF files (streaming, large files), classifies mutation types (missense, nonsense, synonymous, frameshift, splice, intronic, intergenic) and structural variants (deletions, duplications, inversions, translocations), applies VAF/depth/quality/consequence filters, annotates with ClinVar/dbSNP/gnomAD/CADD via ToolUniverse, interprets SV/CNV clinical significance using ClinGen dosage sensitivity scores, computes variant statistics, and generates reports. Solves questions like "What fraction of variants with VAF < 0.3 are missense?", "How many non-reference variants remain after filtering intronic/intergenic?", "What is the pathogenicity of this deletion affecting BRCA1?", or "Which dosage-sensitive genes overlap this CNV?". Use when processing VCF files, annotating variants, filtering by VAF/depth/consequence, classifying mutations, interpreting structural variants, assessing CNV pathogenicity, comparing cohorts, or answering variant analysis questions.

tooluniverse-structural-variant-analysis

1202

from mims-harvard/ToolUniverse

Comprehensive structural variant (SV) analysis skill for clinical genomics. Classifies SVs (deletions, duplications, inversions, translocations), assesses pathogenicity using ACMG-adapted criteria, evaluates gene disruption and dosage sensitivity, and provides clinical interpretation with evidence grading. Use when analyzing CNVs, large deletions/duplications, chromosomal rearrangements, or any structural variants requiring clinical interpretation.

tooluniverse-spatial-omics-analysis

1202

from mims-harvard/ToolUniverse

Computational analysis framework for spatial multi-omics data integration. Given spatially variable genes (SVGs), spatial domain annotations, tissue type, and disease context from spatial transcriptomics/proteomics experiments (10x Visium, MERFISH, DBiTplus, SLIDE-seq, etc.), performs comprehensive biological interpretation including pathway enrichment, cell-cell interaction inference, druggable target identification, immune microenvironment characterization, and multi-modal integration. Produces a detailed markdown report with Spatial Omics Integration Score (0-100), domain-by-domain characterization, and validation recommendations. Uses 70+ ToolUniverse tools across 9 analysis phases. Use when users ask about spatial transcriptomics analysis, spatial omics interpretation, tissue heterogeneity, spatial gene expression patterns, tumor microenvironment mapping, tissue zonation, or cell-cell communication from spatial data.

tooluniverse-sequence-analysis

1202

from mims-harvard/ToolUniverse

Retrieve and analyze biological sequences -- gene/protein sequences from NCBI, Ensembl, and UniProt. Search nucleotide databases, fetch by accession, find orthologs, get gene summaries. Use when users ask about DNA/RNA/protein sequences, gene lookups, ortholog searches, or sequence retrieval.

tooluniverse-regulatory-variant-analysis

1202

from mims-harvard/ToolUniverse

Regulatory variant interpretation -- GWAS association lookup, eQTL analysis, chromatin state annotation, regulatory element overlap, and trait ontology resolution. Connects GWAS Catalog, GTEx, ENCODE, RegulomeDB, OpenTargets, OLS ontology, and Ensembl regulatory features. Use when users ask about non-coding variants, GWAS hits, eQTLs, regulatory elements, enhancer/promoter variants, or trait-associated SNPs.

tooluniverse-proteomics-analysis

1202

from mims-harvard/ToolUniverse

Analyze mass spectrometry proteomics data including protein quantification, differential expression, post-translational modifications (PTMs), and protein-protein interactions. Processes MaxQuant, Spectronaut, DIA-NN, and other MS platform outputs. Performs normalization, statistical analysis, pathway enrichment, and integration with transcriptomics. Use when analyzing proteomics data, comparing protein abundance between conditions, identifying PTM changes, studying protein complexes, integrating protein and RNA data, discovering protein biomarkers, or conducting quantitative proteomics experiments.

tooluniverse-protein-modification-analysis

1202

from mims-harvard/ToolUniverse

Analyze post-translational modifications (PTMs) of proteins — modification sites, types, proteoforms, functional effects at PTM sites, and PTM-dependent protein interactions. Integrates iPTMnet, ProtVar, UniProt, and STRING databases. Use when asked about protein phosphorylation, ubiquitination, acetylation, glycosylation, methylation, SUMOylation, or other PTMs; proteoform diversity; PTM-regulated interactions; or functional impact of PTM sites.

Protein Interaction Network Analysis

1202

from mims-harvard/ToolUniverse

Analyze protein-protein interaction networks using STRING, BioGRID, and SASBDB databases. Maps protein identifiers, retrieves interaction networks with confidence scores, performs functional enrichment analysis (GO/KEGG/Reactome), and optionally includes structural data. No API key required for core functionality (STRING). Use when analyzing protein networks, discovering interaction partners, identifying functional modules, or studying protein complexes.

tooluniverse-population-genetics-1000genomes

1202

from mims-harvard/ToolUniverse

Population genetics research using the 1000 Genomes Project (IGSR) -- search populations by superpopulation ancestry (AFR, AMR, EAS, EUR, SAS), retrieve samples by population code, list available data collections, and integrate with GWAS tools for population stratification analysis. Use when users ask about 1000 Genomes populations, sample ancestry, allele frequency variation across continental groups, population-specific GWAS interpretation, or IGSR data collections like the 30x high-coverage resequencing or HGSVC.

tooluniverse-phylogenetics

1202

from mims-harvard/ToolUniverse

Production-ready phylogenetics and sequence analysis skill for alignment processing, tree analysis, and evolutionary metrics. Computes treeness, RCV, treeness/RCV, parsimony informative sites, evolutionary rate, DVMC, tree length, alignment gap statistics, GC content, and bootstrap support using PhyKIT, Biopython, and DendroPy. Performs NJ/UPGMA/parsimony tree construction, Robinson-Foulds distance, Mann-Whitney U tests, and batch analysis across gene families. Integrates with ToolUniverse for sequence retrieval (NCBI, UniProt, Ensembl) and tree annotation. Use when processing FASTA/PHYLIP/Nexus/Newick files, computing phylogenetic metrics, comparing taxa groups, or answering questions about alignments, trees, parsimony, or molecular evolution.

tooluniverse-pathway-disease-genetics

1202

from mims-harvard/ToolUniverse

Connect GWAS variants to biological pathways for drug target discovery. Maps disease-associated SNPs to causal genes via eQTL colocalization (GTEx), links genes to enriched pathways (Reactome, KEGG, MetaCyc), and identifies druggable targets within disease-relevant pathways. Use when asked to translate GWAS findings into mechanistic insights, find pathways enriched for disease genes, discover drug targets from genetic evidence, or answer questions like "What pathways are disrupted in type 2 diabetes based on GWAS data?"

tooluniverse-model-organism-genetics

1202

from mims-harvard/ToolUniverse

Cross-species genetic analysis using model organism databases. Maps human genes to orthologs in mouse, fly, worm, zebrafish, yeast, and frog, then retrieves phenotypes, expression, and functional data from MGI, FlyBase, WormBase, ZFIN, SGD, and Xenbase. Use when users ask about model organisms, gene orthologs, mouse phenotypes, fly genetics, worm RNAi, zebrafish morphants, cross-species comparison, animal models for human disease, or conservation of gene function.