bio-variant-normalization

Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis.

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

bio-variant-normalization is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis.

Teams using bio-variant-normalization should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-variant-normalization/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-variant-normalization/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bio-variant-normalization/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bio-variant-normalization Compares

Feature / Agent	bio-variant-normalization	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Version Compatibility

Reference examples tested with: bcftools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Variant Normalization

Left-align indels and split multiallelic sites using bcftools norm.

## Why Normalize?

The same variant can be represented multiple ways:

```
# Same deletion, different representations
chr1  100  ATCG  A      (right-aligned)
chr1  100  ATC   A      (left-aligned, normalized)
chr1  101  TCG   T      (different position)
```

Normalization ensures consistent representation for:
- Comparing variants from different callers
- Database lookups (dbSNP, ClinVar)
- Merging VCF files

## bcftools norm

**Goal:** Left-align indels and check reference allele consistency.

**Approach:** Use bcftools norm with a reference FASTA to shift indels to the leftmost position and optionally fix/exclude REF mismatches.

**"Normalize my VCF before comparing callers"** → Left-align indel representations and split multiallelic sites for consistent variant comparison.

### Left-Align Indels

```bash
bcftools norm -f reference.fa input.vcf.gz -Oz -o normalized.vcf.gz
```

Requires reference FASTA to determine left-most representation.

### Check for Normalization Issues

```bash
bcftools norm -f reference.fa -c s input.vcf.gz > /dev/null
# Reports REF allele mismatches
```

Check modes (`-c`):
- `w` - Warn on mismatch (default)
- `e` - Error on mismatch
- `x` - Exclude mismatches
- `s` - Set correct REF from reference

## Multiallelic Sites

**Goal:** Convert multiallelic sites to biallelic records or vice versa.

**Approach:** Use bcftools norm -m flags to split (decompose) or join (merge) multiallelic records.

### Split Multiallelic to Biallelic

```bash
bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz
```

Before:
```
chr1  100  .  A  G,T  30  PASS  .  GT  1/2
```

After:
```
chr1  100  .  A  G  30  PASS  .  GT  1/0
chr1  100  .  A  T  30  PASS  .  GT  0/1
```

### Split SNPs Only

```bash
bcftools norm -m-snps input.vcf.gz -Oz -o split_snps.vcf.gz
```

### Split Indels Only

```bash
bcftools norm -m-indels input.vcf.gz -Oz -o split_indels.vcf.gz
```

### Join Biallelic to Multiallelic

```bash
bcftools norm -m+any input.vcf.gz -Oz -o merged.vcf.gz
```

## Split Options

| Option | Description |
|--------|-------------|
| `-m-any` | Split all multiallelic sites |
| `-m-snps` | Split multiallelic SNPs only |
| `-m-indels` | Split multiallelic indels only |
| `-m-both` | Split SNPs and indels separately |
| `-m+any` | Join biallelic sites into multiallelic |
| `-m+snps` | Join biallelic SNPs |
| `-m+indels` | Join biallelic indels |
| `-m+both` | Join SNPs and indels separately |

## Combined Normalization

**Goal:** Left-align indels and split multiallelic sites in a single pass.

**Approach:** Combine -f (reference) and -m-any (split) flags in one bcftools norm invocation.

### Standard Normalization Pipeline

```bash
bcftools norm -f reference.fa -m-any input.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
```

This:
1. Left-aligns indels
2. Splits multiallelic sites

### Remove Duplicates After Splitting

```bash
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz -Oz -o normalized.vcf.gz
```

Duplicate removal options (`-d`):
- `exact` - Remove exact duplicates
- `snps` - Remove duplicate SNPs
- `indels` - Remove duplicate indels
- `both` - Remove duplicate SNPs and indels
- `all` - Remove all duplicates
- `none` - Keep duplicates (default)

## Fixing Reference Alleles

**Goal:** Correct or remove variants whose REF allele does not match the reference genome.

**Approach:** Use bcftools norm -c with mode s (set correct REF) or x (exclude mismatches).

### Fix Mismatches from Reference

```bash
bcftools norm -f reference.fa -c s input.vcf.gz -Oz -o fixed.vcf.gz
```

This sets REF alleles to match the reference genome.

### Exclude Mismatches

```bash
bcftools norm -f reference.fa -c x input.vcf.gz -Oz -o clean.vcf.gz
```

Removes variants where REF doesn't match reference.

## Atomize Complex Variants

**Goal:** Decompose multi-nucleotide polymorphisms (MNPs) into individual SNP records.

**Approach:** Use bcftools norm --atomize to break complex substitutions into atomic single-base changes.

### Split MNPs to SNPs

```bash
bcftools norm --atomize input.vcf.gz -Oz -o atomized.vcf.gz
```

Before:
```
chr1  100  .  ATG  GCA  30  PASS
```

After:
```
chr1  100  .  A  G  30  PASS
chr1  101  .  T  C  30  PASS
chr1  102  .  G  A  30  PASS
```

### Atomize and Left-Align

```bash
bcftools norm -f reference.fa --atomize input.vcf.gz -Oz -o atomized.vcf.gz
```

## Old to New Format

### Update VCF Version

```bash
bcftools norm --old-rec-tag OLD input.vcf.gz -Oz -o updated.vcf.gz
```

Tags original record for reference.

## Common Workflows

**Goal:** Apply normalization as a preprocessing step for downstream analyses.

**Approach:** Normalize both VCFs identically before comparison, annotation, or GWAS preparation.

### Before Comparing Callers

```bash
# Normalize both VCFs the same way
for vcf in caller1.vcf.gz caller2.vcf.gz; do
    base=$(basename "$vcf" .vcf.gz)
    bcftools norm -f reference.fa -m-any "$vcf" -Oz -o "${base}.norm.vcf.gz"
    bcftools index "${base}.norm.vcf.gz"
done

# Now compare
bcftools isec -p comparison caller1.norm.vcf.gz caller2.norm.vcf.gz
```

### Before Database Annotation

```bash
bcftools norm -f reference.fa -m-any variants.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
# Now annotate against dbSNP, ClinVar, etc.
```

### Prepare for GWAS

```bash
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz | \
    bcftools view -v snps -Oz -o gwas_ready.vcf.gz
bcftools index gwas_ready.vcf.gz
```

## cyvcf2 Normalization Check

**Goal:** Assess how many variants require normalization before running bcftools norm.

**Approach:** Iterate with cyvcf2 and count multiallelic sites and complex (MNP) variants.

### Check if Variants Need Normalization

```python
from cyvcf2 import VCF

def needs_normalization(variant):
    # Check for multiallelic
    if len(variant.ALT) > 1:
        return True

    # Check for complex variants (potential MNPs)
    ref, alt = variant.REF, variant.ALT[0]
    if len(ref) > 1 and len(alt) > 1 and len(ref) == len(alt):
        return True

    return False

count = 0
for variant in VCF('input.vcf.gz'):
    if needs_normalization(variant):
        count += 1

print(f'Variants needing normalization: {count}')
```

### Count Multiallelic Sites

```python
from cyvcf2 import VCF

multiallelic = 0
total = 0

for variant in VCF('input.vcf.gz'):
    total += 1
    if len(variant.ALT) > 1:
        multiallelic += 1

print(f'Total variants: {total}')
print(f'Multiallelic sites: {multiallelic}')
print(f'Percentage: {multiallelic/total*100:.1f}%')
```

## Quick Reference

| Task | Command |
|------|---------|
| Left-align indels | `bcftools norm -f ref.fa in.vcf.gz` |
| Split multiallelic | `bcftools norm -m-any in.vcf.gz` |
| Join to multiallelic | `bcftools norm -m+any in.vcf.gz` |
| Full normalization | `bcftools norm -f ref.fa -m-any in.vcf.gz` |
| Fix REF alleles | `bcftools norm -f ref.fa -c s in.vcf.gz` |
| Remove duplicates | `bcftools norm -d exact in.vcf.gz` |
| Atomize MNPs | `bcftools norm --atomize in.vcf.gz` |

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `REF does not match` | Wrong reference | Use same reference as caller |
| `not sorted` | Unsorted input | Run `bcftools sort` first |
| `duplicate records` | Same position twice | Use `-d` to remove |

## Related Skills

- variant-calling - Generate VCF files
- filtering-best-practices - Filter after normalization
- vcf-manipulation - Compare normalized VCFs
- variant-annotation - Annotate normalized variants

Related Skills

tooluniverse-variant-interpretation

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Systematic clinical variant interpretation from raw variant calls to ACMG-classified recommendations with structural impact analysis. Aggregates evidence from ClinVar, gnomAD, CIViC, UniProt, and PDB across ACMG criteria. Produces pathogenicity scores (0-100), clinical recommendations, and treatment implications. Use when interpreting genetic variants, classifying variants of uncertain significance (VUS), performing ACMG variant classification, or translating variant calls to clinical actionability.

tooluniverse-variant-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready VCF processing, variant annotation, mutation analysis, and structural variant (SV/CNV) interpretation for bioinformatics questions. Parses VCF files (streaming, large files), classifies mutation types (missense, nonsense, synonymous, frameshift, splice, intronic, intergenic) and structural variants (deletions, duplications, inversions, translocations), applies VAF/depth/quality/consequence filters, annotates with ClinVar/dbSNP/gnomAD/CADD via ToolUniverse, interprets SV/CNV clinical significance using ClinGen dosage sensitivity scores, computes variant statistics, and generates reports. Solves questions like "What fraction of variants with VAF < 0.3 are missense?", "How many non-reference variants remain after filtering intronic/intergenic?", "What is the pathogenicity of this deletion affecting BRCA1?", or "Which dosage-sensitive genes overlap this CNV?". Use when processing VCF files, annotating variants, filtering by VAF/depth/consequence, classifying mutations, interpreting structural variants, assessing CNV pathogenicity, comparing cohorts, or answering variant analysis questions.

tooluniverse-structural-variant-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Comprehensive structural variant (SV) analysis skill for clinical genomics. Classifies SVs (deletions, duplications, inversions, translocations), assesses pathogenicity using ACMG-adapted criteria, evaluates gene disruption and dosage sensitivity, and provides clinical interpretation with evidence grading. Use when analyzing CNVs, large deletions/duplications, chromosomal rearrangements, or any structural variants requiring clinical interpretation.

tooluniverse-cancer-variant-interpretation

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Provide comprehensive clinical interpretation of somatic mutations in cancer. Given a gene symbol + variant (e.g., EGFR L858R, BRAF V600E) and optional cancer type, performs multi-database analysis covering clinical evidence (CIViC), mutation prevalence (cBioPortal), therapeutic associations (OpenTargets, ChEMBL, FDA), resistance mechanisms, clinical trials, prognostic impact, and pathway context. Generates an evidence-graded markdown report with actionable recommendations for precision oncology. Use when oncologists, molecular tumor boards, or researchers ask about treatment options for specific cancer mutations, resistance mechanisms, or clinical trial matching.

bio-variant-calling

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Call SNPs and indels from aligned reads using bcftools mpileup and call. Use when detecting variants from BAM files or generating VCF from alignments.

bio-variant-calling-structural-variant-calling

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Call structural variants (SVs) from short-read sequencing using Manta, Delly, and LUMPY. Detects deletions, insertions, inversions, duplications, and translocations that are too large for standard SNV callers. Use when detecting structural variants from short-read data.

bio-variant-calling-joint-calling

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Joint genotype calling across multiple samples using GATK CombineGVCFs and GenotypeGVCFs. Essential for cohort studies, population genetics, and leveraging VQSR. Use when performing joint genotyping across multiple samples.

bio-variant-calling-filtering-best-practices

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Comprehensive variant filtering including GATK VQSR, hard filters, bcftools expressions, and quality metric interpretation for SNPs and indels. Use when filtering variants using GATK best practices.

bio-variant-calling-deepvariant

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Deep learning-based variant calling with Google DeepVariant. Provides high accuracy for germline SNPs and indels from Illumina, PacBio, and ONT data. Use when calling variants with DeepVariant deep learning caller.

bio-variant-calling-clinical-interpretation

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Clinical variant interpretation using ClinVar, ACMG guidelines, and pathogenicity predictors. Prioritize variants for diagnostic and research applications. Use when interpreting clinical significance of variants.

bio-variant-annotation

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Comprehensive variant annotation using bcftools annotate/csq, VEP, SnpEff, and ANNOVAR. Add database annotations, predict functional consequences, and assess clinical significance. Use when annotating variants with functional and clinical information.

bio-metabolomics-normalization-qc

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Quality control and normalization for metabolomics data. Covers QC-based correction, batch effect removal, and data transformation methods. Use when correcting technical variation in metabolomics data before statistical analysis.