biopython-alignment

Sequence alignment and alignment file processing with Biopython (Bio.Align/Bio.AlignIO), triggered when you need global/local pairwise alignment, MSA read/write/format conversion, or alignment statistics/filtering.

53 stars

byaipoch

View on GitHub Installation ↓

Best use case

biopython-alignment is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using biopython-alignment should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/biopython-alignment/SKILL.md --create-dirs "https://raw.githubusercontent.com/aipoch/medical-research-skills/main/scientific-skills/Data Analysis/biopython-alignment/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/biopython-alignment/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How biopython-alignment Compares

Feature / Agent	biopython-alignment	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)

# biopython-alignment

## When to Use

- You need **global alignment** between two protein (or nucleotide) sequences and want a reproducible score and aligned strings.
- You need **local alignment** to find the best matching fragment/subsequence between two DNA/RNA/protein sequences.
- You need to **read, write, or convert** multiple sequence alignment (MSA) files (e.g., FASTA/Clustal/Stockholm) using Biopython I/O.
- You want to compute **alignment statistics** (e.g., identity, coverage, conservation per column) and filter alignments by thresholds.
- You need to apply **substitution matrices** (e.g., BLOSUM62) and tune gap penalties for biologically meaningful scoring.

## Key Features

- Pairwise alignment via `Bio.Align.PairwiseAligner` (global and local modes).
- Alignment scoring with configurable match/mismatch and gap penalties.
- Protein substitution matrices via `Bio.Align.substitution_matrices` (e.g., BLOSUM/PAM).
- MSA parsing and serialization via `Bio.AlignIO` (read/write/format conversion).
- Basic alignment statistics: identity, aligned length, coverage, and MSA column conservation.

## Dependencies

- `biopython>=1.81`
- `numpy>=1.21`

## Example Usage

```python
# -*- coding: utf-8 -*-
"""
Runnable examples for:
1) Global protein alignment
2) Local DNA alignment (best fragment)
3) MSA parsing + column conservation

Requires: biopython, numpy
"""

from __future__ import annotations

from io import StringIO
import numpy as np

from Bio.Align import PairwiseAligner
from Bio.Align import substitution_matrices
from Bio import AlignIO


def global_protein_alignment(seq_a: str, seq_b: str) -> None:
    matrix = substitution_matrices.load("BLOSUM62")

    aligner = PairwiseAligner()
    aligner.mode = "global"
    aligner.substitution_matrix = matrix
    aligner.open_gap_score = -10.0
    aligner.extend_gap_score = -0.5

    alignments = aligner.align(seq_a, seq_b)
    best = alignments[0]

    print("=== Global protein alignment (best) ===")
    print("Score:", best.score)
    print(best)


def local_dna_alignment_best_fragment(seq_a: str, seq_b: str) -> None:
    aligner = PairwiseAligner()
    aligner.mode = "local"
    aligner.match_score = 2.0
    aligner.mismatch_score = -1.0
    aligner.open_gap_score = -2.0
    aligner.extend_gap_score = -0.5

    best = aligner.align(seq_a, seq_b)[0]

    # Extract the aligned fragment coordinates from the first aligned block.
    # aligned is a tuple: (aligned_coords_in_seq_a, aligned_coords_in_seq_b)
    a_blocks, b_blocks = best.aligned
    a_start, a_end = a_blocks[0]
    b_start, b_end = b_blocks[0]

    print("=== Local DNA alignment (best) ===")
    print("Score:", best.score)
    print(best)
    print("Best fragment in seq_a:", seq_a[a_start:a_end], f"(coords {a_start}:{a_end})")
    print("Best fragment in seq_b:", seq_b[b_start:b_end], f"(coords {b_start}:{b_end})")


def msa_column_conservation(fasta_text: str) -> None:
    handle = StringIO(fasta_text)
    msa = AlignIO.read(handle, "fasta")  # MultipleSeqAlignment

    # Convert to a 2D array of characters: shape (n_seqs, aln_len)
    arr = np.array([list(str(rec.seq)) for rec in msa], dtype="U1")
    n_seqs, aln_len = arr.shape

    # Conservation per column: fraction of the most common non-gap character.
    # Treat '-' as gap; ignore gaps when computing the most common residue.
    conservation = []
    for j in range(aln_len):
        col = arr[:, j]
        col = col[col != "-"]
        if col.size == 0:
            conservation.append(0.0)
            continue
        values, counts = np.unique(col, return_counts=True)
        conservation.append(float(counts.max() / counts.sum()))

    print("=== MSA column conservation ===")
    print("n_seqs:", n_seqs, "aln_len:", aln_len)
    print("conservation:", [round(x, 3) for x in conservation])


def main() -> None:
    # 1) Global alignment (protein)
    seq_a = "MKTAYIAKQRQISFVKSHFSRQDILD"
    seq_b = "MKLAYIAKQRQISFVKSHFTRQDILN"
    global_protein_alignment(seq_a, seq_b)

    # 2) Local alignment (DNA)
    seq_a = "ATGCGTACGTTAGC"
    seq_b = "GGGATGCGTACGAAAC"
    local_dna_alignment_best_fragment(seq_a, seq_b)

    # 3) MSA conservation (FASTA)
    fasta_text = ">s1\nACGTACGT\n>s2\nACGTTCGT\n>s3\nACGTACGA\n"
    msa_column_conservation(fasta_text)


if __name__ == "__main__":
    main()
```

## Implementation Details

- **Pairwise alignment engine**: uses `Bio.Align.PairwiseAligner`, which performs dynamic programming alignment under the selected mode:
  - `mode="global"`: aligns full-length sequences end-to-end.
  - `mode="local"`: finds the highest-scoring matching region (best subsequence pair).
- **Scoring configuration**:
  - For proteins, prefer `substitution_matrix` (e.g., `BLOSUM62`) plus gap penalties (`open_gap_score`, `extend_gap_score`).
  - For nucleotides, a simple scheme is common: `match_score`, `mismatch_score`, and gap penalties.
- **Selecting the best alignment**: `aligner.align(a, b)` returns an iterable of alignments sorted by score; use `[0]` for the top-scoring result.
- **Local “best fragment” extraction**:
  - `alignment.aligned` returns aligned coordinate blocks for each sequence.
  - The first block `(start, end)` typically corresponds to the highest-scoring contiguous aligned region; slice the original sequences with these coordinates to obtain the fragment.
- **MSA I/O and statistics**:
  - `Bio.AlignIO.read(handle, fmt)` parses an alignment into a `MultipleSeqAlignment`.
  - Column conservation can be computed as:  
    `max_count(non-gap residues in column) / total_non_gap_count(column)`.
- **Operational conventions (recommended)**:
  - Store runtime configuration in `config/task_config.json` and invoke scripts as `python scripts/<task_name>.py`.
  - Avoid stacking many CLI `--` parameters; keep parameters in the config file.
  - Always specify `encoding="utf-8"` for file I/O; for JSON output use `ensure_ascii=False`.

Related Skills

biopython-entrez

from aipoch/medical-research-skills

Use Bio.Entrez to access NCBI databases (e.g., PubMed/GenBank) for searching, fetching summaries, and downloading records when your workflow needs to call the NCBI E-utilities API over the network.

sequence-alignment

from aipoch/medical-research-skills

A skill for performing sequence alignment using NCBI BLAST API. Supports nucleotide and protein sequence comparison against major biological databases.

biopython

from aipoch/medical-research-skills

A comprehensive toolbox for computational molecular biology; use it when you need programmatic sequence/structure parsing, batch bioinformatics pipelines, or automated NCBI/BLAST workflows.

biopython-structure

from aipoch/medical-research-skills

Use Bio.PDB to parse and analyze protein structures (PDB/mmCIF) for structural bioinformatics tasks; use when you need structure parsing, geometry calculations, or structural comparison/superposition.

biopython-sequence-io

from aipoch/medical-research-skills

Use Biopython to read/write/convert biological sequence files (FASTA/GenBank/FASTQ, etc.) and perform basic sequence operations; use when you need reliable sequence I/O, lightweight sequence manipulation, or scalable processing of large sequence datasets.

biopython-phylo

from aipoch/medical-research-skills

Use Bio.Phylo to read/write phylogenetic trees and perform visualization and statistics; use when tree parsing/conversion, pruning/rerooting, distance calculation, or plotting is required.

biopython-advanced

from aipoch/medical-research-skills

Advanced Biopython modules for motifs, population genetics, sequence utilities, restriction analysis, clustering, and GenomeDiagram visualization; use when you need extended bioinformatics analysis beyond basic sequence I/O and alignment.

skill-auditor

from aipoch/medical-research-skills

A comprehensive auditor for any agent skill — including Manus, OpenClaw/ClawHub, Claude, LobeHub, or custom SKILL.md-based skills. Use this skill whenever a user wants to evaluate, audit, review, score, or quality-check an agent skill before publishing, updating, or deploying. Covers two hard veto gates (structural redlines + research integrity redlines), static quality scoring across 25 criteria (ISO 25010 + OpenSSF + Agent), dynamic test input generation, multi-mode execution testing, multi-layer output evaluation with five specialized category rubrics (Evidence Insight / Protocol Design / Data Analysis / Academic Writing / Other), a Research Veto that applies to all four research categories, human eval viewer generation, actionable P0/P1/P2 optimization recommendations, and automatic skill improvement that outputs a polished, production-ready SKILL.md. Also use whenever a user says "audit my skill", "evaluate my skill", "improve my skill", or wants a corrected version after evaluation.

two-sample-mr-research-planner

from aipoch/medical-research-skills

Generates complete two-sample Mendelian randomization (MR) research designs from a user-provided research direction. Use when users want to design, plan, or build a study using two-sample MR to test causal relationships. Triggers:"design a two-sample MR study", "build a publishable MR paper", "test whether this biomarker causally affects this disease", "generate Lite/Standard/Advanced MR plans", "screen multiple exposures with MR", "bidirectional MR design", "causal inference using GWAS summary statistics", or "I want to study X and Y using MR". Always outputs four workload configurations (Lite / Standard / Advanced / Publication+) with a recommended primary plan, step-by-step workflow, figure plan, validation strategy, minimal executable version, and publication upgrade path.

research-proposal-generator

from aipoch/medical-research-skills

Generates a comprehensive research proposal design based on input literature, including hypothesis, mechanism verification, and budget. Use when the user wants to design a research project from a paper.

research-grants

from aipoch/medical-research-skills

Write competitive research proposals for NSF, NIH, DOE, DARPA, and Taiwan's NSTC when you need agency-compliant narratives, budgets, and review-criteria alignment for a specific solicitation/FOA/BAA.

protocol-standardization

from aipoch/medical-research-skills

Standardize fragmented experimental steps into reproducible protocol documents when you need method organization, lab SOP drafting, or cross-operator reproducibility; missing parameters must be explicitly marked as "To be supplemented/Not provided".