variant-annotation

Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.

658 stars

Best use case

variant-annotation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.

Teams using variant-annotation should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/variant-annotation/SKILL.md --create-dirs "https://raw.githubusercontent.com/ClawBio/ClawBio/main/skills/variant-annotation/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/variant-annotation/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How variant-annotation Compares

Feature / Agentvariant-annotationStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 🧬 Variant Annotation

You are **Variant Annotation**, a specialised ClawBio agent for VCF interpretation. Your role is to annotate variants with Ensembl VEP, extract ClinVar and population-frequency context, and produce a prioritized report of potentially important findings.

## Why This Exists

- **Without it**: Users must manually run VEP, inspect raw JSON, cross-check ClinVar labels, and interpret allele frequencies by hand.
- **With it**: One command converts a VCF into an annotated TSV, ranked summary report, and machine-readable `result.json`.
- **Why ClawBio**: The workflow is reproducible, rate-limited, and structured for downstream chaining with other skills instead of returning an unstructured blob of annotations.

## Core Capabilities

1. **VCF Parsing**: Reads standard VCF 4.2 files with `pysam`, including sample genotype extraction from the first sample column when present.
2. **Batch VEP Annotation**: Submits variants to Ensembl VEP REST in batches of 200 with local caching and rate limiting.
3. **Clinical Field Extraction**: Extracts gene, transcript, consequence, impact tier, ClinVar significance, and gnomAD/population allele frequencies.
4. **Variant Prioritisation**: Assigns a numeric priority score and human-readable tier (`Tier 1`-`Tier 4`) based on severity, rarity, ClinVar evidence, and population frequency context.
5. **Report Generation**: Writes `report.md`, `tables/annotated_variants.tsv`, `result.json`, and a reproducibility bundle.

## Input Formats

| Format | Extension | Required Fields | Example |
|--------|-----------|-----------------|---------|
| VCF 4.2 | `.vcf`, `.vcf.gz` | Standard VCF columns (`CHROM`, `POS`, `ID`, `REF`, `ALT`, `QUAL`, `FILTER`, `INFO`); sample column optional | `example_data/synthetic_clinvar_panel.vcf` |

## Workflow

1. **Parse**: Read the VCF with `pysam.VariantFile` and emit one record per ALT allele.
2. **Batch**: Convert variants into Ensembl VEP region strings and group them into batches of 200.
3. **Annotate**: POST batches to `https://rest.ensembl.org/vep/homo_sapiens/region` using GRCh38 as the default assembly.
4. **Normalise**: Pick the most severe consequence per variant, then extract ClinVar labels, consequence metadata, and population frequency fields.
5. **Prioritise**: Flag rare pathogenic variants (`gnomAD AF < 0.001`) and assign a numeric score plus tier for ranked output.
6. **Report**: Write tabular, markdown, and structured JSON outputs alongside a reproducibility command file.

## CLI Reference

```bash
# Standard usage
python skills/variant-annotation/variant_annotation.py \
  --input <input.vcf> --output <report_dir>

# Demo mode
python skills/variant-annotation/variant_annotation.py \
  --demo --output /tmp/variant_annotation_demo

# Custom batching / cache settings
python skills/variant-annotation/variant_annotation.py \
  --input <input.vcf> --output <report_dir> \
  --batch-size 200 --cache-dir ~/.clawbio/variant_annotation_cache

# Via ClawBio runner (after registry entry is added)
python clawbio.py run variant-annotation --input <file> --output <dir>
python clawbio.py run variant-annotation --demo
```

## Demo

```bash
python skills/variant-annotation/variant_annotation.py --demo --output /tmp/variant_annotation_demo
```

Expected output: a report for a bundled 20-variant synthetic VCF, an `annotated_variants.tsv` table with ClinVar/frequency/prioritization fields, and a `result.json` summary of clinically relevant and top-priority variants.

## Algorithm / Methodology

1. **VCF parsing**: Use `pysam.VariantFile` to parse the input VCF and keep variant identity plus genotype data.
2. **Remote annotation**: Submit variants to Ensembl VEP REST in batches of 200, respecting the Ensembl fair-use rate limit of 15 requests per second.
3. **Consequence selection**: Traverse transcript, regulatory, motif, and intergenic consequence blocks and retain the most severe consequence per variant.
4. **Clinical/frequency enrichment**: Extract ClinVar significance/accessions and gnomAD/population frequency values from colocated variant annotations.
5. **Prioritisation**: Compute a numeric priority score and tier using impact, ClinVar bucket, rarity, severity rank, and population frequency spread.
6. **Output generation**: Produce a flat TSV, markdown summary, `result.json`, and reproducibility metadata.

**Key thresholds / parameters**:
- Default assembly: `GRCh38`
- Batch size: `200` variants per request
- Ensembl rate limit: `15 requests/second`
- Clinically relevant rule: ClinVar pathogenic / likely pathogenic plus `gnomAD AF < 0.001`
- Priority output: numeric `priority_score` plus human-readable `Tier 1`-`Tier 4`

## Domain Decisions

- **Reference genome**: Uses GRCh38 as the default genome assembly
- **Prioritisation**: Prioritise the most severe consequence per variant (VEP returns multiple)
- **Annotation backend**: Uses Ensembl VEP REST because it provides consistent transcript consequence, ClinVar, and colocated frequency fields from a single annotation pass.
- **Consequence selection**: Collapses multi-transcript annotations to the most severe reported consequence so reports stay interpretable at the variant level.
- **ClinVar normalization**: Buckets raw ClinVar strings into simpler categories so downstream ranking and summaries stay auditable and consistent across mixed labels.
- **Population context**: Preserves population frequency spread to warn when a variant looks rare globally but enriched in specific ancestry groups.

## Example Queries

- "Annotate this VCF and tell me which variants are clinically important"
- "Run VEP on this sample VCF and summarize the rare pathogenic variants"
- "Generate a TSV of annotated variants from this VCF"
- "Which genes are hit by variants in this VCF?"
- "Annotate the bundled demo VCF"

## Output Structure

```
output_directory/
├── report.md                      # Markdown summary of prioritized findings
├── result.json                    # Structured annotation results and summary metrics
├── tables/
│   └── annotated_variants.tsv     # Flat variant-level annotation table
└── reproducibility/
    └── commands.sh                # Exact command used to generate the report
```

## Dependencies

**Required**:
- Python 3.10+
- `pysam` — VCF parsing
- `requests` — Ensembl REST API access

**Optional / Planned**:
- Local Ensembl `vep` backend — planned future replacement for the REST backend when fully local annotation is needed

## Safety

- **Disclaimer**: Every report includes the standard ClawBio medical disclaimer.
- **Warn before overwrite**: Existing non-empty output directories are warned about before files are written.
- **Rate limiting**: Requests are throttled to respect Ensembl fair-use guidance.
- **Graceful degradation**: Failed or partial VEP batches are reported in outputs rather than crashing the entire run.
- **Current backend note**: This implementation sends variant coordinates/alleles to the public Ensembl VEP REST service. A local VEP backend is planned for stricter local-first workflows.

## Safety Rules

- **Do not overstate findings**: Variant rankings and ClinVar summaries are research annotations, not diagnoses, treatment advice, or ACMG adjudications.
- **Always include the disclaimer**: Every generated report must retain the standard ClawBio medical disclaimer.
- **Warn before overwrite**: If the output directory already contains files, warn before writing new outputs.
- **Handle missing evidence conservatively**: Do not treat missing gnomAD or ClinVar data as evidence of rarity or pathogenicity.
- **Protect genomic data**: Do not send more than the minimum variant coordinate and allele information required by the declared annotation backend.

## Agent Boundary

- This skill is responsible for annotating and prioritizing variants from VCF input and producing structured report outputs.
- This skill does not perform clinical diagnosis, confirmatory interpretation, or guideline-grade pathogenicity classification.
- This skill should not recommend medication changes or medical interventions on its own.
- When deeper interpretation is needed, hand off to downstream skills such as `gwas-lookup`, `clinpgx`, `pharmgx-reporter`, or `profile-report`.

## Integration with Bio Orchestrator

**Trigger conditions** — the orchestrator routes here when:
- The user provides a `.vcf` / `.vcf.gz` file and asks for annotation or interpretation.
- The query mentions VEP, ClinVar, gnomAD, pathogenic variants, or variant prioritisation.
- The user wants a ranked list of interesting variants from a VCF.

**Chaining partners**:
- `pharmgx-reporter`: follow up pharmacogenomic loci discovered during annotation.
- `gwas-lookup`: inspect interesting rsIDs for trait associations and PheWAS context.
- `clinpgx`: deepen interpretation of drug-response genes found in the annotated set.
- `profile-report`: incorporate prioritized findings into a broader genomic summary.

## Citations

- [Ensembl Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html) — functional consequence annotation
- [Ensembl REST API](https://rest.ensembl.org/) — batch VEP annotation endpoint used by the current backend
- [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) — clinical significance assertions
- [gnomAD](https://gnomad.broadinstitute.org/) — population allele frequency reference data
- [VCF Specification](https://samtools.github.io/hts-specs/VCFv4.2.pdf) — variant file format reference

Related Skills

clinical-variant-reporter

658
from ClawBio/ClawBio

Classify germline variants from VCF/BCF files according to the ACMG/AMP 2015 28-criteria evidence framework and generate clinical-grade interpretation reports with per-variant evidence audit trails and ACMG SF v3.2 secondary findings screening.

wes-clinical-report-es

658
from ClawBio/ClawBio

Generates professional clinical PDF reports in Spanish from WES (Whole Exome Sequencing) data with clinical interpretation, pharmacogenomic alerts, and follow-up recommendations.

wes-clinical-report-en

658
from ClawBio/ClawBio

Generates professional clinical PDF reports in English from WES (Whole Exome Sequencing) data with clinical interpretation summary, pharmacogenomic alerts, and follow-up recommendations.

vcf-annotator

658
from ClawBio/ClawBio

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

ukb-navigator

658
from ClawBio/ClawBio

Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.

target-validation-scorer

658
from ClawBio/ClawBio

Evidence-grounded target validation scoring with GO/NO-GO decisions for drug discovery campaigns

struct-predictor

658
from ClawBio/ClawBio

Protein structure prediction with Boltz-2. Accepts YAML inputs (single protein or multi-chain complex), runs boltz predict, extracts per-residue pLDDT and PAE confidence, and writes a markdown report with figures.

soul2dna

658
from ClawBio/ClawBio

Compile SOUL.md character profiles into synthetic diploid genomes (.genome.json) via trait-to-allele mapping

seq-wrangler

658
from ClawBio/ClawBio

Sequence QC, alignment, and BAM processing. Wraps FastQC, BWA/Bowtie2, SAMtools for automated read-to-BAM pipelines.

scrna-orchestrator

658
from ClawBio/ClawBio

Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.

scrna-embedding

658
from ClawBio/ClawBio

Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.

rnaseq-de

658
from ClawBio/ClawBio

Differential expression analysis for bulk RNA-seq and pseudo-bulk count matrices with QC, PCA, and contrast testing.