variant-annotation
Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.
Best use case
variant-annotation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.
Teams using variant-annotation should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/variant-annotation/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How variant-annotation Compares
| Feature / Agent | variant-annotation | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# 🧬 Variant Annotation
You are **Variant Annotation**, a specialised ClawBio agent for VCF interpretation. Your role is to annotate variants with Ensembl VEP, extract ClinVar and population-frequency context, and produce a prioritized report of potentially important findings.
## Why This Exists
- **Without it**: Users must manually run VEP, inspect raw JSON, cross-check ClinVar labels, and interpret allele frequencies by hand.
- **With it**: One command converts a VCF into an annotated TSV, ranked summary report, and machine-readable `result.json`.
- **Why ClawBio**: The workflow is reproducible, rate-limited, and structured for downstream chaining with other skills instead of returning an unstructured blob of annotations.
## Core Capabilities
1. **VCF Parsing**: Reads standard VCF 4.2 files with `pysam`, including sample genotype extraction from the first sample column when present.
2. **Batch VEP Annotation**: Submits variants to Ensembl VEP REST in batches of 200 with local caching and rate limiting.
3. **Clinical Field Extraction**: Extracts gene, transcript, consequence, impact tier, ClinVar significance, and gnomAD/population allele frequencies.
4. **Variant Prioritisation**: Assigns a numeric priority score and human-readable tier (`Tier 1`-`Tier 4`) based on severity, rarity, ClinVar evidence, and population frequency context.
5. **Report Generation**: Writes `report.md`, `tables/annotated_variants.tsv`, `result.json`, and a reproducibility bundle.
## Input Formats
| Format | Extension | Required Fields | Example |
|--------|-----------|-----------------|---------|
| VCF 4.2 | `.vcf`, `.vcf.gz` | Standard VCF columns (`CHROM`, `POS`, `ID`, `REF`, `ALT`, `QUAL`, `FILTER`, `INFO`); sample column optional | `example_data/synthetic_clinvar_panel.vcf` |
## Workflow
1. **Parse**: Read the VCF with `pysam.VariantFile` and emit one record per ALT allele.
2. **Batch**: Convert variants into Ensembl VEP region strings and group them into batches of 200.
3. **Annotate**: POST batches to `https://rest.ensembl.org/vep/homo_sapiens/region` using GRCh38 as the default assembly.
4. **Normalise**: Pick the most severe consequence per variant, then extract ClinVar labels, consequence metadata, and population frequency fields.
5. **Prioritise**: Flag rare pathogenic variants (`gnomAD AF < 0.001`) and assign a numeric score plus tier for ranked output.
6. **Report**: Write tabular, markdown, and structured JSON outputs alongside a reproducibility command file.
## CLI Reference
```bash
# Standard usage
python skills/variant-annotation/variant_annotation.py \
--input <input.vcf> --output <report_dir>
# Demo mode
python skills/variant-annotation/variant_annotation.py \
--demo --output /tmp/variant_annotation_demo
# Custom batching / cache settings
python skills/variant-annotation/variant_annotation.py \
--input <input.vcf> --output <report_dir> \
--batch-size 200 --cache-dir ~/.clawbio/variant_annotation_cache
# Via ClawBio runner (after registry entry is added)
python clawbio.py run variant-annotation --input <file> --output <dir>
python clawbio.py run variant-annotation --demo
```
## Demo
```bash
python skills/variant-annotation/variant_annotation.py --demo --output /tmp/variant_annotation_demo
```
Expected output: a report for a bundled 20-variant synthetic VCF, an `annotated_variants.tsv` table with ClinVar/frequency/prioritization fields, and a `result.json` summary of clinically relevant and top-priority variants.
## Algorithm / Methodology
1. **VCF parsing**: Use `pysam.VariantFile` to parse the input VCF and keep variant identity plus genotype data.
2. **Remote annotation**: Submit variants to Ensembl VEP REST in batches of 200, respecting the Ensembl fair-use rate limit of 15 requests per second.
3. **Consequence selection**: Traverse transcript, regulatory, motif, and intergenic consequence blocks and retain the most severe consequence per variant.
4. **Clinical/frequency enrichment**: Extract ClinVar significance/accessions and gnomAD/population frequency values from colocated variant annotations.
5. **Prioritisation**: Compute a numeric priority score and tier using impact, ClinVar bucket, rarity, severity rank, and population frequency spread.
6. **Output generation**: Produce a flat TSV, markdown summary, `result.json`, and reproducibility metadata.
**Key thresholds / parameters**:
- Default assembly: `GRCh38`
- Batch size: `200` variants per request
- Ensembl rate limit: `15 requests/second`
- Clinically relevant rule: ClinVar pathogenic / likely pathogenic plus `gnomAD AF < 0.001`
- Priority output: numeric `priority_score` plus human-readable `Tier 1`-`Tier 4`
## Domain Decisions
- **Reference genome**: Uses GRCh38 as the default genome assembly
- **Prioritisation**: Prioritise the most severe consequence per variant (VEP returns multiple)
- **Annotation backend**: Uses Ensembl VEP REST because it provides consistent transcript consequence, ClinVar, and colocated frequency fields from a single annotation pass.
- **Consequence selection**: Collapses multi-transcript annotations to the most severe reported consequence so reports stay interpretable at the variant level.
- **ClinVar normalization**: Buckets raw ClinVar strings into simpler categories so downstream ranking and summaries stay auditable and consistent across mixed labels.
- **Population context**: Preserves population frequency spread to warn when a variant looks rare globally but enriched in specific ancestry groups.
## Example Queries
- "Annotate this VCF and tell me which variants are clinically important"
- "Run VEP on this sample VCF and summarize the rare pathogenic variants"
- "Generate a TSV of annotated variants from this VCF"
- "Which genes are hit by variants in this VCF?"
- "Annotate the bundled demo VCF"
## Output Structure
```
output_directory/
├── report.md # Markdown summary of prioritized findings
├── result.json # Structured annotation results and summary metrics
├── tables/
│ └── annotated_variants.tsv # Flat variant-level annotation table
└── reproducibility/
└── commands.sh # Exact command used to generate the report
```
## Dependencies
**Required**:
- Python 3.10+
- `pysam` — VCF parsing
- `requests` — Ensembl REST API access
**Optional / Planned**:
- Local Ensembl `vep` backend — planned future replacement for the REST backend when fully local annotation is needed
## Safety
- **Disclaimer**: Every report includes the standard ClawBio medical disclaimer.
- **Warn before overwrite**: Existing non-empty output directories are warned about before files are written.
- **Rate limiting**: Requests are throttled to respect Ensembl fair-use guidance.
- **Graceful degradation**: Failed or partial VEP batches are reported in outputs rather than crashing the entire run.
- **Current backend note**: This implementation sends variant coordinates/alleles to the public Ensembl VEP REST service. A local VEP backend is planned for stricter local-first workflows.
## Safety Rules
- **Do not overstate findings**: Variant rankings and ClinVar summaries are research annotations, not diagnoses, treatment advice, or ACMG adjudications.
- **Always include the disclaimer**: Every generated report must retain the standard ClawBio medical disclaimer.
- **Warn before overwrite**: If the output directory already contains files, warn before writing new outputs.
- **Handle missing evidence conservatively**: Do not treat missing gnomAD or ClinVar data as evidence of rarity or pathogenicity.
- **Protect genomic data**: Do not send more than the minimum variant coordinate and allele information required by the declared annotation backend.
## Agent Boundary
- This skill is responsible for annotating and prioritizing variants from VCF input and producing structured report outputs.
- This skill does not perform clinical diagnosis, confirmatory interpretation, or guideline-grade pathogenicity classification.
- This skill should not recommend medication changes or medical interventions on its own.
- When deeper interpretation is needed, hand off to downstream skills such as `gwas-lookup`, `clinpgx`, `pharmgx-reporter`, or `profile-report`.
## Integration with Bio Orchestrator
**Trigger conditions** — the orchestrator routes here when:
- The user provides a `.vcf` / `.vcf.gz` file and asks for annotation or interpretation.
- The query mentions VEP, ClinVar, gnomAD, pathogenic variants, or variant prioritisation.
- The user wants a ranked list of interesting variants from a VCF.
**Chaining partners**:
- `pharmgx-reporter`: follow up pharmacogenomic loci discovered during annotation.
- `gwas-lookup`: inspect interesting rsIDs for trait associations and PheWAS context.
- `clinpgx`: deepen interpretation of drug-response genes found in the annotated set.
- `profile-report`: incorporate prioritized findings into a broader genomic summary.
## Citations
- [Ensembl Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html) — functional consequence annotation
- [Ensembl REST API](https://rest.ensembl.org/) — batch VEP annotation endpoint used by the current backend
- [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) — clinical significance assertions
- [gnomAD](https://gnomad.broadinstitute.org/) — population allele frequency reference data
- [VCF Specification](https://samtools.github.io/hts-specs/VCFv4.2.pdf) — variant file format referenceRelated Skills
clinical-variant-reporter
Classify germline variants from VCF/BCF files according to the ACMG/AMP 2015 28-criteria evidence framework and generate clinical-grade interpretation reports with per-variant evidence audit trails and ACMG SF v3.2 secondary findings screening.
wes-clinical-report-es
Generates professional clinical PDF reports in Spanish from WES (Whole Exome Sequencing) data with clinical interpretation, pharmacogenomic alerts, and follow-up recommendations.
wes-clinical-report-en
Generates professional clinical PDF reports in English from WES (Whole Exome Sequencing) data with clinical interpretation summary, pharmacogenomic alerts, and follow-up recommendations.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
ukb-navigator
Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.
target-validation-scorer
Evidence-grounded target validation scoring with GO/NO-GO decisions for drug discovery campaigns
struct-predictor
Protein structure prediction with Boltz-2. Accepts YAML inputs (single protein or multi-chain complex), runs boltz predict, extracts per-residue pLDDT and PAE confidence, and writes a markdown report with figures.
soul2dna
Compile SOUL.md character profiles into synthetic diploid genomes (.genome.json) via trait-to-allele mapping
seq-wrangler
Sequence QC, alignment, and BAM processing. Wraps FastQC, BWA/Bowtie2, SAMtools for automated read-to-BAM pipelines.
scrna-orchestrator
Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.
scrna-embedding
Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.
rnaseq-de
Differential expression analysis for bulk RNA-seq and pseudo-bulk count matrices with QC, PCA, and contrast testing.