bio-alignment-validation

Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

bio-alignment-validation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using bio-alignment-validation should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-alignment-validation/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/bio-alignment-validation/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bio-alignment-validation/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bio-alignment-validation Compares

Feature / Agent	bio-alignment-validation	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Alignment Validation

Post-alignment quality control to verify alignment quality and identify issues.

## Insert Size Distribution

Insert size should match library preparation protocol.

### samtools stats

```bash
samtools stats input.bam > stats.txt
grep "^IS" stats.txt | cut -f2,3 > insert_sizes.txt
```

### Picard CollectInsertSizeMetrics

```bash
java -jar picard.jar CollectInsertSizeMetrics \
    I=input.bam \
    O=insert_metrics.txt \
    H=insert_histogram.pdf
```

### Expected Insert Sizes

| Library Type | Expected Size |
|--------------|---------------|
| Standard WGS | 300-500 bp |
| PCR-free | 350-550 bp |
| RNA-seq | 150-300 bp |
| ChIP-seq | 150-300 bp |
| ATAC-seq | Multimodal |

### Python Insert Size Analysis

```python
import pysam
import numpy as np
import matplotlib.pyplot as plt

def get_insert_sizes(bam_file, max_reads=100000):
    sizes = []
    bam = pysam.AlignmentFile(bam_file, 'rb')
    for i, read in enumerate(bam.fetch()):
        if i >= max_reads:
            break
        if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
            sizes.append(read.template_length)
    bam.close()
    return sizes

sizes = get_insert_sizes('sample.bam')
print(f'Median insert size: {np.median(sizes):.0f}')
print(f'Mean insert size: {np.mean(sizes):.0f}')
print(f'Std dev: {np.std(sizes):.0f}')

plt.hist(sizes, bins=100, range=(0, 1000))
plt.xlabel('Insert Size')
plt.ylabel('Count')
plt.savefig('insert_size_dist.pdf')
```

## Proper Pairing Rate

Percentage of reads correctly paired.

### samtools flagstat

```bash
samtools flagstat input.bam

samtools flagstat input.bam | grep "properly paired"
```

### Calculate Pairing Rate

```bash
proper=$(samtools view -c -f 2 input.bam)
mapped=$(samtools view -c -F 4 input.bam)
rate=$(echo "scale=4; $proper / $mapped * 100" | bc)
echo "Proper pairing rate: ${rate}%"
```

### Expected Rates

| Metric | Good | Marginal | Poor |
|--------|------|----------|------|
| Proper pair | > 90% | 80-90% | < 80% |
| Mapped | > 95% | 90-95% | < 90% |
| Singletons | < 5% | 5-10% | > 10% |

## GC Bias

GC content correlation with coverage.

### Picard CollectGcBiasMetrics

```bash
java -jar picard.jar CollectGcBiasMetrics \
    I=input.bam \
    O=gc_bias_metrics.txt \
    CHART=gc_bias_chart.pdf \
    S=gc_summary.txt \
    R=reference.fa
```

### deepTools computeGCBias

```bash
computeGCBias \
    -b input.bam \
    --effectiveGenomeSize 2913022398 \
    -g hg38.2bit \
    -o gc_bias.txt \
    --biasPlot gc_bias.pdf
```

### Interpret GC Bias

| Issue | Symptom |
|-------|---------|
| Under-representation | Low GC coverage drops |
| Over-representation | High GC coverage elevated |
| PCR bias | Strong correlation |

## Strand Balance

Forward and reverse strand should be balanced.

### Calculate Strand Ratio

```bash
forward=$(samtools view -c -F 16 input.bam)
reverse=$(samtools view -c -f 16 input.bam)
echo "Forward: $forward"
echo "Reverse: $reverse"
ratio=$(echo "scale=4; $forward / $reverse" | bc)
echo "F/R ratio: $ratio"
```

### Check Strand Bias per Chromosome

```bash
for chr in chr1 chr2 chr3; do
    fwd=$(samtools view -c -F 16 input.bam $chr)
    rev=$(samtools view -c -f 16 input.bam $chr)
    echo "$chr: F=$fwd R=$rev ratio=$(echo "scale=2; $fwd/$rev" | bc)"
done
```

## Mapping Quality Distribution

### Extract MAPQ Distribution

```bash
samtools view input.bam | cut -f5 | sort -n | uniq -c | sort -k2 -n
```

### Calculate Mean MAPQ

```bash
samtools view input.bam | awk '{sum+=$5; count++} END {print "Mean MAPQ:", sum/count}'
```

### MAPQ Thresholds

| MAPQ | Meaning |
|------|---------|
| 0 | Multi-mapper |
| 1-10 | Low confidence |
| 20-30 | Moderate |
| 40+ | High confidence |
| 60 | Unique (BWA) |

## Chromosome Coverage Balance

### Calculate Per-Chromosome Coverage

```bash
samtools idxstats input.bam | awk '{print $1, $3/$2}' | head -25
```

### Check for Aneuploidy/Contamination

```bash
samtools idxstats input.bam | awk '$3 > 0 {
    sum += $3
    len[$1] = $2
    reads[$1] = $3
} END {
    for (chr in reads) {
        expected = len[chr] / sum * reads[chr]
        ratio = reads[chr] / expected
        if (ratio < 0.8 || ratio > 1.2) print chr, ratio
    }
}'
```

## Mismatch Rate

### Picard CollectAlignmentSummaryMetrics

```bash
java -jar picard.jar CollectAlignmentSummaryMetrics \
    I=input.bam \
    R=reference.fa \
    O=alignment_summary.txt
```

### Key Metrics

| Metric | Description | Good Value |
|--------|-------------|------------|
| PCT_PF_READS_ALIGNED | Mapped % | > 95% |
| PF_MISMATCH_RATE | Mismatches | < 1% |
| PF_INDEL_RATE | Indels | < 0.1% |
| STRAND_BALANCE | Strand ratio | ~0.5 |

## Comprehensive Validation Script

```bash
#!/bin/bash
BAM=$1
REF=$2
NAME=$(basename $BAM .bam)
OUTDIR=${3:-qc}

mkdir -p $OUTDIR

echo "=== Alignment Validation: $NAME ===" | tee $OUTDIR/report.txt

echo -e "\n--- Flagstat ---" | tee -a $OUTDIR/report.txt
samtools flagstat $BAM | tee -a $OUTDIR/report.txt

echo -e "\n--- Mapping Rate ---" | tee -a $OUTDIR/report.txt
mapped=$(samtools view -c -F 4 $BAM)
total=$(samtools view -c $BAM)
rate=$(echo "scale=2; $mapped / $total * 100" | bc)
echo "Mapping rate: ${rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Proper Pairing ---" | tee -a $OUTDIR/report.txt
proper=$(samtools view -c -f 2 $BAM)
pair_rate=$(echo "scale=2; $proper / $mapped * 100" | bc)
echo "Proper pairing: ${pair_rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Insert Size ---" | tee -a $OUTDIR/report.txt
samtools stats $BAM | grep "insert size average" | tee -a $OUTDIR/report.txt

echo -e "\n--- Strand Balance ---" | tee -a $OUTDIR/report.txt
fwd=$(samtools view -c -F 16 $BAM)
rev=$(samtools view -c -f 16 $BAM)
strand_ratio=$(echo "scale=3; $fwd / $rev" | bc)
echo "Forward: $fwd, Reverse: $rev, Ratio: $strand_ratio" | tee -a $OUTDIR/report.txt

echo -e "\n--- Chromosome Coverage ---" | tee -a $OUTDIR/report.txt
samtools idxstats $BAM | head -25 | tee -a $OUTDIR/report.txt

echo -e "\nReport: $OUTDIR/report.txt"
```

## Python Validation Module

```python
import pysam
import numpy as np
from collections import Counter

class AlignmentValidator:
    def __init__(self, bam_file):
        self.bam = pysam.AlignmentFile(bam_file, 'rb')

    def mapping_rate(self):
        stats = self.bam.get_index_statistics()
        mapped = sum(s.mapped for s in stats)
        unmapped = sum(s.unmapped for s in stats)
        return mapped / (mapped + unmapped) * 100

    def proper_pair_rate(self, sample_size=100000):
        proper = 0
        paired = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if read.is_paired:
                paired += 1
                if read.is_proper_pair:
                    proper += 1
        return proper / paired * 100 if paired > 0 else 0

    def mapq_distribution(self, sample_size=100000):
        mapqs = []
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                mapqs.append(read.mapping_quality)
        return Counter(mapqs)

    def strand_balance(self, sample_size=100000):
        forward = 0
        reverse = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                if read.is_reverse:
                    reverse += 1
                else:
                    forward += 1
        return forward / (forward + reverse) if (forward + reverse) > 0 else 0.5

    def report(self):
        print(f'Mapping rate: {self.mapping_rate():.1f}%')
        print(f'Proper pairing: {self.proper_pair_rate():.1f}%')
        print(f'Strand balance: {self.strand_balance():.3f}')

        mapq_dist = self.mapq_distribution()
        high_qual = sum(v for k, v in mapq_dist.items() if k >= 30)
        total = sum(mapq_dist.values())
        print(f'High MAPQ (>=30): {high_qual/total*100:.1f}%')

    def close(self):
        self.bam.close()

validator = AlignmentValidator('sample.bam')
validator.report()
validator.close()
```

## Quality Thresholds Summary

| Metric | Good | Warning | Fail |
|--------|------|---------|------|
| Mapping rate | > 95% | 90-95% | < 90% |
| Proper pairing | > 90% | 80-90% | < 80% |
| Duplicate rate | < 10% | 10-20% | > 20% |
| Strand balance | 0.48-0.52 | 0.45-0.55 | Outside |
| Mean MAPQ | > 40 | 30-40 | < 30 |
| GC bias | < 1.2x | 1.2-1.5x | > 1.5x |

## Related Skills

- bam-statistics - Basic flagstat and depth
- duplicate-handling - Mark/remove duplicates
- alignment-filtering - Filter low-quality reads
- chipseq-qc - ChIP-specific metrics

Related Skills

date-validation

from diegosouzapw/awesome-omni-skill

Use when editing Planning Hubs, timelines, calendars, or any file with day-name + date combinations (Wed Nov 12), relative dates (tomorrow), or countdowns (18 days until) - validates day-of-week accuracy, relative date calculations, and countdown math with two-source ground truth verification before allowing edits

spring-validation

from diegosouzapw/awesome-omni-skill

Bean Validation (Jakarta Validation) with Spring Boot. Custom validators, validation groups, cross-field validation, and internationalized error messages.

fullstack-validation

from diegosouzapw/awesome-omni-skill

Comprehensive validation methodology for multi-component applications including backend, frontend, database, and infrastructure

bio-alignment-io

from diegosouzapw/awesome-omni-skill

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

api-validation

from diegosouzapw/awesome-omni-skill

Apply when validating API request inputs: body, query params, path params, and headers. This skill covers Zod v4 patterns.

api-request-validation

from diegosouzapw/awesome-omni-skill

A skill for implementing robust API request validation in Python web frameworks like FastAPI using Pydantic. Covers Pydantic models, custom validators (email, password), field-level and cross-field validation, query/file validation, and structured error responses. Use when you need to validate incoming API requests.

api-contracts-and-zod-validation

from diegosouzapw/awesome-omni-skill

Generate Zod schemas and TypeScript types for forms, API routes, and Server Actions with runtime validation. Use this skill when creating API contracts, validating request/response payloads, generating form schemas, adding input validation to Server Actions or route handlers, or ensuring type safety across client-server boundaries. Trigger terms include zod, schema, validation, API contract, form validation, type inference, runtime validation, parse, safeParse, input validation, request validation, Server Action validation.

api-contracts-and-validation

from diegosouzapw/awesome-omni-skill

Define and validate API contracts using Zod

api-contract-validation

from diegosouzapw/awesome-omni-skill

Detect breaking changes in API contracts (OpenAPI/Swagger specs)

android-playstore-api-validation

from diegosouzapw/awesome-omni-skill

Create and run validation script to test Play Store API connection

agent-ops-validation

from diegosouzapw/awesome-omni-skill

Pre-commit and pre-merge validation checks. Use before committing changes or declaring work complete to ensure all quality gates pass.

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development