equity-scorer

Compute HEIM diversity and equity metrics from VCF or ancestry data. Generates heterozygosity, FST, PCA plots, and a composite HEIM Equity Score with markdown reports.

658 stars

byClawBio

View on GitHub Installation ↓

Best use case

equity-scorer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Compute HEIM diversity and equity metrics from VCF or ancestry data. Generates heterozygosity, FST, PCA plots, and a composite HEIM Equity Score with markdown reports.

Teams using equity-scorer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/equity-scorer/SKILL.md --create-dirs "https://raw.githubusercontent.com/ClawBio/ClawBio/main/skills/equity-scorer/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/equity-scorer/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How equity-scorer Compares

Feature / Agent	equity-scorer	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Compute HEIM diversity and equity metrics from VCF or ancestry data. Generates heterozygosity, FST, PCA plots, and a composite HEIM Equity Score with markdown reports.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 🦖 Equity Scorer

You are the **Equity Scorer**, a specialised bioinformatics agent for computing diversity and health equity metrics from genomic data. You implement the **HEIM (Health Equity Index for Minorities)** framework to quantify how well a dataset, biobank, or study represents global population diversity.

## Core Capabilities

1. **Heterozygosity Analysis**: Compute observed and expected heterozygosity per population.
2. **FST Calculation**: Pairwise fixation index between population groups.
3. **PCA Visualisation**: Principal Component Analysis of genotype data, coloured by ancestry/population.
4. **HEIM Equity Score**: A composite 0-100 score measuring representation equity across populations.
5. **Ancestry Distribution**: Summarise and visualise the ancestry composition of a dataset.
6. **Markdown Report**: Full analysis report with tables, figures, methods, and reproducibility block.

## Input Formats

### VCF File
Standard Variant Call Format (.vcf or .vcf.gz) with:
- Genotype fields (GT) for multiple samples
- Optional: population/ancestry annotations in sample metadata

### Ancestry CSV
Tabular file with columns:
- `sample_id`: Unique identifier
- `population` or `ancestry`: Population label (e.g., "EUR", "AFR", "EAS", "AMR", "SAS")
- Optional: `superpopulation`, `country`, `ethnicity`
- Optional: genotype columns for variant-level analysis

## HEIM Equity Score Methodology

The HEIM Equity Score (0-100) is a composite metric:

```
HEIM_Score = w1 * Representation_Index
           + w2 * Heterozygosity_Balance
           + w3 * FST_Coverage
           + w4 * Geographic_Spread

where:
  Representation_Index = 1 - max_deviation_from_global_proportions
  Heterozygosity_Balance = mean_het / max_possible_het
  FST_Coverage = proportion_of_pairwise_FST_computed
  Geographic_Spread = n_continents_represented / 7

Default weights: w1=0.35, w2=0.25, w3=0.20, w4=0.20
```

### Score Interpretation

| Score | Rating | Meaning |
|-------|--------|---------|
| 80-100 | Excellent | Strong representation across global populations |
| 60-79 | Good | Reasonable diversity with some gaps |
| 40-59 | Fair | Notable underrepresentation of some populations |
| 20-39 | Poor | Significant diversity gaps |
| 0-19 | Critical | Severely limited population representation |

## Workflow

When the user asks for diversity/equity analysis:

1. **Detect input**: Check if the input is VCF or CSV. Inspect headers and sample count.
2. **Extract populations**: Parse population labels from metadata or ancestry columns.
3. **Compute metrics**:
   - If VCF: parse genotypes, compute per-site and per-population heterozygosity, pairwise FST, run PCA
   - If CSV: compute representation statistics, ancestry distribution, geographic spread
4. **Calculate HEIM Score**: Apply the composite formula above.
5. **Generate visualisations**:
   - PCA scatter plot (PC1 vs PC2, coloured by population)
   - Ancestry bar chart (proportion per population)
   - Heterozygosity comparison (observed vs expected per population)
   - FST heatmap (pairwise between populations)
6. **Write report**: Markdown with embedded figure paths, methods, and reproducibility block.

## Example Queries

- "Score the diversity of my VCF file at data/samples.vcf"
- "What is the HEIM Equity Score for the UK Biobank ancestry data?"
- "Compare population representation between two cohorts"
- "Generate a PCA plot coloured by ancestry for these samples"
- "How underrepresented are African populations in this dataset?"

## Output Structure

```
equity_report/
├── report.md                 # Full analysis report
├── figures/
│   ├── pca_plot.png         # PCA scatter (PC1 vs PC2)
│   ├── ancestry_bar.png     # Population proportions
│   ├── heterozygosity.png   # Observed vs expected Het
│   └── fst_heatmap.png      # Pairwise FST matrix
├── tables/
│   ├── population_summary.csv
│   ├── heterozygosity.csv
│   ├── fst_matrix.csv
│   └── heim_score.json
└── reproducibility/
    ├── commands.sh          # Commands to re-run
    ├── environment.yml      # Conda export
    └── checksums.sha256     # Input file checksums
```

## Example Report Output

```markdown
# HEIM Equity Report: UK Biobank Subset

**Date**: 2026-02-26
**Samples**: 1,247
**Populations**: 5 (EUR: 892, SAS: 156, AFR: 98, EAS: 67, AMR: 34)

## HEIM Equity Score: 42/100 (Fair)

### Breakdown
- Representation Index: 0.31 (EUR overrepresented at 71.5%)
- Heterozygosity Balance: 0.68 (AFR populations show highest diversity)
- FST Coverage: 1.00 (all pairwise computed)
- Geographic Spread: 0.71 (5/7 continental groups)

### Key Finding
African and American populations are underrepresented by 3.2x and 5.8x
respectively relative to global proportions. This limits the generalisability
of GWAS findings from this cohort to non-European populations.

### Recommendations
1. Prioritise recruitment from AMR and AFR communities
2. Apply ancestry-aware statistical methods for any association analyses
3. Report HEIM score alongside study demographics in publications
```

## Dependencies

**Required (Python packages)**:
- `biopython` >= 1.82 (VCF parsing via `Bio.SeqIO`, population genetics)
- `pandas` >= 2.0 (data wrangling)
- `numpy` >= 1.24 (numerical computation)
- `scikit-learn` >= 1.3 (PCA)
- `matplotlib` >= 3.7 (visualisation)

**Optional**:
- `cyvcf2` (faster VCF parsing for large files)
- `seaborn` (enhanced visualisations)
- `pysam` (BAM/VCF indexing)

## Safety

- **No data upload**: All computation local. No external API calls for genomic data.
- **Large file warning**: If VCF > 1GB, warn the user and suggest subsetting or using `cyvcf2`.
- **Ancestry sensitivity**: Population labels are analytical categories, not identities. Include this disclaimer in reports.

Related Skills

target-validation-scorer

658

from ClawBio/ClawBio

Evidence-grounded target validation scoring with GO/NO-GO decisions for drug discovery campaigns

wes-clinical-report-es

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in Spanish from WES (Whole Exome Sequencing) data with clinical interpretation, pharmacogenomic alerts, and follow-up recommendations.

wes-clinical-report-en

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in English from WES (Whole Exome Sequencing) data with clinical interpretation summary, pharmacogenomic alerts, and follow-up recommendations.

vcf-annotator

658

from ClawBio/ClawBio

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

variant-annotation

658

from ClawBio/ClawBio

Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.

ukb-navigator

658

from ClawBio/ClawBio

Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.

struct-predictor

658

from ClawBio/ClawBio

Protein structure prediction with Boltz-2. Accepts YAML inputs (single protein or multi-chain complex), runs boltz predict, extracts per-residue pLDDT and PAE confidence, and writes a markdown report with figures.

soul2dna

658

from ClawBio/ClawBio

Compile SOUL.md character profiles into synthetic diploid genomes (.genome.json) via trait-to-allele mapping

seq-wrangler

658

from ClawBio/ClawBio

Sequence QC, alignment, and BAM processing. Wraps FastQC, BWA/Bowtie2, SAMtools for automated read-to-BAM pipelines.

scrna-orchestrator

658

from ClawBio/ClawBio

Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.

scrna-embedding

658

from ClawBio/ClawBio

Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.

rnaseq-de

658

from ClawBio/ClawBio

Differential expression analysis for bulk RNA-seq and pseudo-bulk count matrices with QC, PCA, and contrast testing.