claw-ancestry-pca

Ancestry decomposition PCA against the Simons Genome Diversity Project

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

claw-ancestry-pca is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Ancestry decomposition PCA against the Simons Genome Diversity Project

Teams using claw-ancestry-pca should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/claw-ancestry-pca/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/claw-ancestry-pca/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/claw-ancestry-pca/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How claw-ancestry-pca Compares

Feature / Agent	claw-ancestry-pca	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Ancestry decomposition PCA against the Simons Genome Diversity Project

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 🦖 Ancestry Decomposition PCA

Place your study cohort in global genetic context by computing a joint PCA against the Simons Genome Diversity Project (SGDP) — 345 samples from 164 populations spanning every inhabited continent.

## What it does

1. Takes your VCF + population map as input
2. Finds common variants between your cohort and the SGDP reference panel (bundled)
3. Runs PLINK PCA on the merged dataset
4. Separates your cohort from SGDP reference samples
5. Matches SGDP samples to their population labels (164 populations)
6. Generates a publication-quality multi-panel figure:
   - **Panel A**: PC1 vs PC2 — main population structure of your cohort
   - **Panel B**: PC3 vs PC2 with regional groupings and confidence ellipses
   - **Panel C**: PC3 vs PC1 with language/cultural groupings
   - **Panel D**: Global context — your samples (circles) vs SGDP (triangles)
7. Produces a markdown report with variance explained, population assignments, and reproducibility bundle

## Why this exists

If you ask ChatGPT to "run a PCA against a global reference panel," it will:
- Not know which reference panel to use
- Hallucinate PLINK flags for merging datasets with different variant sets
- Skip IBD removal (related individuals distort PCA)
- Not normalise contig names between your VCF and the reference
- Produce a single scatter plot with no population labels

This skill encodes the correct methodological decisions:
- Uses SGDP (the gold-standard reference for global diversity)
- Handles contig normalisation (chr1 vs 1)
- Filters to common biallelic SNPs shared between datasets
- Removes related individuals via IBD checks
- Produces publication-quality multi-panel figures with confidence ellipses
- Differentiates your samples (circles) from reference (triangles)

## Reference Panel

The skill bundles the SGDP v4 dataset (Mallick et al., 2016, Nature):
- 345 samples from 164 populations
- Whole-genome sequencing at high coverage
- MAF > 0.1% filter applied
- Populations span: Africa, Americas, Central/South Asia, East Asia, Europe, Middle East, Oceania

## Usage

```bash
python ancestry_pca.py \
    --vcf your_cohort.vcf.gz \
    --pop-map your_populations.tsv \
    --output ancestry_report
```

### Demo (works out of the box)

```bash
python ancestry_pca.py --demo --output demo_report
```

The demo uses pre-computed PCA results from the Peruvian Genome Project (736 samples, 28 populations) and generates the full 4-panel figure instantly.

## Example Output

```
Ancestry Decomposition PCA
==========================
Cohort: 736 samples, 28 populations
Reference: SGDP (345 samples, 164 populations)
Common variants: 42,831 biallelic SNPs

Variance explained:
  PC1: 51.44%  PC2: 21.70%  PC3: 6.70%

Panel D — Global Context:
  Cohort samples cluster between European and East Asian
  reference populations, with Amazonian groups showing
  distinct positioning from Highland and Coastal groups.

Figures saved to: ancestry_report/
  Figure3_PCA_composite.png (300 dpi)
  Figure3_PCA_composite.pdf (vector)

Reproducibility:
  commands.sh | environment.yml | checksums.sha256
```

## Interpretation Guide

- **PC1** typically captures the largest axis of global differentiation (often Africa vs non-Africa)
- **PC2** separates major continental groups (Europe, East Asia, Americas)
- **PC3** often reveals finer substructure within continental groups
- Confidence ellipses show 2.5 standard deviations around each population cluster
- Your samples shown as **circles**, SGDP reference as **triangles**

## Citation

If you use this skill in a publication, please cite:

- Mallick, S. et al. (2016). The Simons Genome Diversity Project. Nature, 538, 201-206.
- Corpas, M. (2026). ClawBio. https://github.com/ClawBio/ClawBio

Related Skills

claw-semantic-sim

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Semantic Similarity Index for disease research literature using PubMedBERT embeddings

claw-metagenomics

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Shotgun metagenomics profiling — taxonomy, resistome, and functional pathways

zinc-database

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

zarr-python

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

xlsx

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

writing-skills

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use when creating new skills, editing existing skills, or verifying skills work before deployment

writing-plans

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use when you have a spec or requirements for a multi-step task, before touching code

wikipedia-search

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Search and fetch structured content from Wikipedia using the MediaWiki API for reliable, encyclopedic information

wellally-tech

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Integrate digital health data sources (Apple Health, Fitbit, Oura Ring) and connect to WellAlly.tech knowledge base. Import external health device data, standardize to local format, and recommend relevant WellAlly.tech knowledge base articles based on health data. Support generic CSV/JSON import, provide intelligent article recommendations, and help users better manage personal health data.

weightloss-analyzer

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

分析减肥数据、计算代谢率、追踪能量缺口、管理减肥阶段

<!--

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

# COPYRIGHT NOTICE

verification-before-completion

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use when about to claim work is complete, fixed, or passing, before committing or creating PRs - requires running verification commands and confirming output before making any success claims; evidence before assertions always