bio-alignment-msa-statistics

Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

bio-alignment-msa-statistics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using bio-alignment-msa-statistics should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-alignment-msa-statistics/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-alignment-msa-statistics/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bio-alignment-msa-statistics/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bio-alignment-msa-statistics Compares

Feature / Agent	bio-alignment-msa-statistics	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Version Compatibility

Reference examples tested with: BioPython 1.83+, numpy 1.26+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# MSA Statistics

Calculate sequence identity, conservation scores, substitution counts, and other alignment metrics.

## Required Import

**Goal:** Load modules for alignment I/O, substitution scoring, and statistical calculations.

**Approach:** Import AlignIO for reading alignments, Counter for column analysis, numpy for matrix operations, and math for entropy calculations.

```python
from Bio import AlignIO
from Bio.Align import substitution_matrices
from collections import Counter
import numpy as np
import math
```

## Pairwise Identity

**"Calculate percent identity"** → Compute the fraction of identical aligned residues between sequence pairs.

**Goal:** Measure sequence similarity as percent identity for individual pairs or across all sequences in an alignment.

**Approach:** Count matching non-gap positions divided by total aligned positions; optionally compute a full N-by-N identity matrix.

### Calculate Identity Between Two Sequences
```python
def pairwise_identity(seq1, seq2):
    matches = sum(a == b and a != '-' for a, b in zip(seq1, seq2))
    aligned_positions = sum(a != '-' or b != '-' for a, b in zip(seq1, seq2))
    return matches / aligned_positions if aligned_positions > 0 else 0

alignment = AlignIO.read('alignment.fasta', 'fasta')
seq1, seq2 = str(alignment[0].seq), str(alignment[1].seq)
identity = pairwise_identity(seq1, seq2)
print(f'Identity: {identity * 100:.1f}%')
```

### Identity Matrix for All Sequences
```python
def identity_matrix(alignment):
    n = len(alignment)
    matrix = np.zeros((n, n))
    for i in range(n):
        for j in range(i, n):
            seq_i = str(alignment[i].seq)
            seq_j = str(alignment[j].seq)
            ident = pairwise_identity(seq_i, seq_j)
            matrix[i, j] = matrix[j, i] = ident
    return matrix

alignment = AlignIO.read('alignment.fasta', 'fasta')
mat = identity_matrix(alignment)
seq_ids = [r.id for r in alignment]
print('Pairwise Identity Matrix:')
print(f'{"":>10}', ' '.join(f'{s[:8]:>8}' for s in seq_ids))
for i, row in enumerate(mat):
    print(f'{seq_ids[i][:10]:>10}', ' '.join(f'{v*100:>7.1f}%' for v in row))
```

## Conservation Score

**Goal:** Quantify per-column and overall alignment conservation to identify conserved and variable regions.

**Approach:** Calculate the fraction of the most common residue at each column, optionally ignoring gaps, and smooth with a sliding window.

### Per-Column Conservation
```python
def column_conservation(alignment, col_idx, ignore_gaps=True):
    column = alignment[:, col_idx]
    if ignore_gaps:
        column = column.replace('-', '')
    if not column:
        return 0.0
    counts = Counter(column)
    most_common_count = counts.most_common(1)[0][1]
    return most_common_count / len(column)

alignment = AlignIO.read('alignment.fasta', 'fasta')
for i in range(min(20, alignment.get_alignment_length())):
    cons = column_conservation(alignment, i)
    print(f'Column {i}: {cons*100:.0f}% conserved')
```

### Average Conservation Across Alignment
```python
def average_conservation(alignment, ignore_gaps=True):
    scores = []
    for col_idx in range(alignment.get_alignment_length()):
        scores.append(column_conservation(alignment, col_idx, ignore_gaps))
    return sum(scores) / len(scores)

avg_cons = average_conservation(alignment)
print(f'Average conservation: {avg_cons*100:.1f}%')
```

### Conservation Profile
```python
def conservation_profile(alignment, window=10):
    profile = []
    for i in range(alignment.get_alignment_length()):
        start = max(0, i - window // 2)
        end = min(alignment.get_alignment_length(), i + window // 2)
        scores = [column_conservation(alignment, j) for j in range(start, end)]
        profile.append(sum(scores) / len(scores))
    return profile

profile = conservation_profile(alignment, window=10)
```

## Substitution Counts

**Goal:** Tabulate observed substitution frequencies from the alignment for evolutionary analysis or custom scoring matrices.

**Approach:** Enumerate all pairwise non-gap character comparisons at each column and tally substitution pairs.

### Count Substitutions from Alignment
```python
def substitution_counts(alignment):
    from collections import defaultdict
    counts = defaultdict(int)
    for col_idx in range(alignment.get_alignment_length()):
        column = alignment[:, col_idx]
        chars = [c for c in column if c != '-']
        for i, c1 in enumerate(chars):
            for c2 in chars[i+1:]:
                if c1 != c2:
                    pair = tuple(sorted([c1, c2]))
                    counts[pair] += 1
    return dict(counts)

subs = substitution_counts(alignment)
print('Substitution counts:')
for pair, count in sorted(subs.items(), key=lambda x: -x[1])[:10]:
    print(f'  {pair[0]}<->{pair[1]}: {count}')
```

### Build Substitution Matrix from MSA
```python
def build_substitution_matrix(alignment):
    from collections import defaultdict
    matrix = defaultdict(lambda: defaultdict(int))

    for col_idx in range(alignment.get_alignment_length()):
        column = alignment[:, col_idx]
        chars = [c for c in column if c != '-']
        for c1 in chars:
            for c2 in chars:
                matrix[c1][c2] += 1

    return {k: dict(v) for k, v in matrix.items()}

sub_matrix = build_substitution_matrix(alignment)
```

### Using Alignment.substitutions (Pairwise Alignments)
For pairwise alignments created with `PairwiseAligner`, use the built-in `.substitutions` property:

```python
from Bio.Align import PairwiseAligner

aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1)
alignments = aligner.align(seq1, seq2)
substitutions = alignments[0].substitutions

# Returns Array with substitution counts
print(substitutions)
```

## Information Content

**Goal:** Measure column variability using Shannon entropy and derive information content for identifying functionally important positions.

**Approach:** Compute Shannon entropy from character frequencies per column; information content is max entropy minus observed entropy.

### Shannon Entropy Per Column
```python
import math

def shannon_entropy(column, ignore_gaps=True):
    if ignore_gaps:
        column = column.replace('-', '')
    if not column:
        return 0.0
    counts = Counter(column)
    total = len(column)
    entropy = 0.0
    for count in counts.values():
        p = count / total
        if p > 0:
            entropy -= p * math.log2(p)
    return entropy

alignment = AlignIO.read('alignment.fasta', 'fasta')
for i in range(min(20, alignment.get_alignment_length())):
    column = alignment[:, i]
    ent = shannon_entropy(column)
    print(f'Column {i}: entropy = {ent:.2f} bits')
```

### Information Content (Max Entropy - Observed Entropy)
```python
def information_content(column, alphabet_size=4):
    max_entropy = math.log2(alphabet_size)  # 4 for DNA, 20 for protein
    observed_entropy = shannon_entropy(column)
    return max_entropy - observed_entropy

# DNA alignment
for i in range(min(20, alignment.get_alignment_length())):
    column = alignment[:, i]
    ic = information_content(column, alphabet_size=4)
    print(f'Column {i}: IC = {ic:.2f} bits')
```

## Gap Statistics

**Goal:** Summarize gap distribution across the alignment to assess alignment quality and identify problematic regions.

**Approach:** Calculate gap fractions per column and aggregate statistics including total gaps, gap-free columns, and gappiest sequence/column.

### Gap Fraction Per Column
```python
def gap_profile(alignment):
    profile = []
    for col_idx in range(alignment.get_alignment_length()):
        column = alignment[:, col_idx]
        gap_fraction = column.count('-') / len(alignment)
        profile.append(gap_fraction)
    return profile

gaps = gap_profile(alignment)
avg_gaps = sum(gaps) / len(gaps)
print(f'Average gap fraction: {avg_gaps*100:.1f}%')
```

### Gap Statistics Summary
```python
def gap_statistics(alignment):
    num_seqs = len(alignment)
    num_cols = alignment.get_alignment_length()

    total_positions = num_seqs * num_cols
    total_gaps = sum(str(r.seq).count('-') for r in alignment)

    gaps_per_seq = [str(r.seq).count('-') for r in alignment]
    gaps_per_col = [alignment[:, i].count('-') for i in range(num_cols)]

    return {
        'total_gaps': total_gaps,
        'gap_fraction': total_gaps / total_positions,
        'gappiest_seq': max(range(num_seqs), key=lambda i: gaps_per_seq[i]),
        'gappiest_col': max(range(num_cols), key=lambda i: gaps_per_col[i]),
        'gap_free_cols': sum(1 for g in gaps_per_col if g == 0),
    }

stats = gap_statistics(alignment)
print(f"Total gaps: {stats['total_gaps']}")
print(f"Gap fraction: {stats['gap_fraction']*100:.1f}%")
print(f"Gap-free columns: {stats['gap_free_cols']}")
```

## Alignment Quality Metrics

**Goal:** Score alignment quality using sum-of-pairs or simple match/mismatch/gap scoring across all columns.

**Approach:** For each column, score all pairwise residue comparisons and sum across the alignment.

### Overall Alignment Score
```python
def alignment_score(alignment, match=1, mismatch=-1, gap=-2):
    total_score = 0
    for col_idx in range(alignment.get_alignment_length()):
        column = alignment[:, col_idx]
        for i, c1 in enumerate(column):
            for c2 in column[i+1:]:
                if c1 == '-' or c2 == '-':
                    total_score += gap
                elif c1 == c2:
                    total_score += match
                else:
                    total_score += mismatch
    return total_score

score = alignment_score(alignment)
print(f'Alignment score: {score}')
```

### Sum of Pairs Score
```python
def sum_of_pairs(alignment, substitution_matrix=None):
    if substitution_matrix is None:
        substitution_matrix = substitution_matrices.load('BLOSUM62')

    total = 0
    for col_idx in range(alignment.get_alignment_length()):
        column = alignment[:, col_idx]
        for i, c1 in enumerate(column):
            for c2 in column[i+1:]:
                if c1 != '-' and c2 != '-':
                    total += substitution_matrix.get((c1, c2), 0)
    return total
```

## Position-Specific Score Matrix (PSSM)

**Goal:** Build a position-specific score matrix (PSSM) from the alignment for motif analysis or sequence scoring.

**Approach:** Count non-gap character frequencies at each column, producing a list of per-position dictionaries.

```python
def position_specific_score_matrix(alignment):
    pssm = []
    for col_idx in range(alignment.get_alignment_length()):
        column = alignment[:, col_idx]
        counts = Counter(column)
        if '-' in counts:
            del counts['-']
        pssm.append(dict(counts))
    return pssm

alignment = AlignIO.read('alignment.fasta', 'fasta')
pssm = position_specific_score_matrix(alignment)
for i, row in enumerate(pssm[:10]):
    print(f'Position {i}: {row}')
```

## Note on Bio.Align.AlignInfo

The `AlignInfo.SummaryInfo` class is **deprecated** in recent Biopython versions. Use the custom functions in this skill instead:
- For PSSM: use `position_specific_score_matrix()` above
- For information content: use `information_content()` function earlier in this skill
- For consensus: see msa-parsing skill

## Quick Reference: Metrics

| Metric | Description | Range |
|--------|-------------|-------|
| Identity | Fraction of identical residues | 0-1 |
| Conservation | Most common residue frequency | 0-1 |
| Shannon Entropy | Variability measure | 0 to log2(alphabet) |
| Information Content | Max entropy - observed entropy | 0 to log2(alphabet) |
| Gap Fraction | Proportion of gaps | 0-1 |

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `ZeroDivisionError` | Empty column after gap removal | Check for gap-only columns |
| `KeyError` | Character not in substitution matrix | Handle gaps separately |
| Negative IC | Wrong alphabet size | Use 4 for DNA, 20 for protein |

## Related Skills

- msa-parsing - Parse and manipulate alignments
- alignment-io - Read/write alignment files
- pairwise-alignment - Create and score pairwise alignments
- sequence-manipulation/sequence-properties - Sequence-level statistics

Related Skills

bio-vcf-statistics

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Generate variant statistics, sample concordance, and quality metrics using bcftools stats and gtcheck. Use when evaluating variant quality, comparing samples, or summarizing VCF contents.

bio-spatial-transcriptomics-spatial-statistics

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Compute spatial statistics for spatial transcriptomics data using Squidpy. Calculate Moran's I, Geary's C, spatial autocorrelation, co-occurrence analysis, and neighborhood enrichment. Use when computing spatial autocorrelation or co-occurrence statistics.

bio-sequence-statistics

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Calculate sequence statistics (N50, length distribution, GC content, summary reports) using Biopython. Use when analyzing sequence datasets, generating QC reports, or comparing assemblies.

bio-methylation-bismark-alignment

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Bisulfite sequencing read alignment using Bismark with bowtie2/hisat2. Handles genome preparation and produces BAM files with methylation information. Use when aligning WGBS, RRBS, or other bisulfite-converted sequencing reads to a reference genome.

bio-longread-alignment

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Align long reads using minimap2 for Oxford Nanopore and PacBio data. Supports various presets for different read types and applications. Use when aligning ONT or PacBio reads to a reference genome for variant calling, SV detection, or coverage analysis.

bio-alignment-pairwise

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.

bio-alignment-msa-parsing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.

bio-alignment-io

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

zinc-database

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

zarr-python

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

xlsx

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

writing-skills

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use when creating new skills, editing existing skills, or verifying skills work before deployment