phylogenetics
Build and analyze phylogenetic trees using MAFFT (multiple alignment), IQ-TREE 2 (maximum likelihood), and FastTree (fast NJ/ML). Visualize with ETE3 or FigTree. For evolutionary analysis, microbial genomics, viral phylodynamics, protein family analysis, and molecular clock studies.
Best use case
phylogenetics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Build and analyze phylogenetic trees using MAFFT (multiple alignment), IQ-TREE 2 (maximum likelihood), and FastTree (fast NJ/ML). Visualize with ETE3 or FigTree. For evolutionary analysis, microbial genomics, viral phylodynamics, protein family analysis, and molecular clock studies.
Teams using phylogenetics should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/phylogenetics/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How phylogenetics Compares
| Feature / Agent | phylogenetics | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Build and analyze phylogenetic trees using MAFFT (multiple alignment), IQ-TREE 2 (maximum likelihood), and FastTree (fast NJ/ML). Visualize with ETE3 or FigTree. For evolutionary analysis, microbial genomics, viral phylodynamics, protein family analysis, and molecular clock studies.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Phylogenetics
## Overview
Phylogenetic analysis reconstructs the evolutionary history of biological sequences (genes, proteins, genomes) by inferring the branching pattern of descent. This skill covers the standard pipeline:
1. **MAFFT** — Multiple sequence alignment
2. **IQ-TREE 2** — Maximum likelihood tree inference with model selection
3. **FastTree** — Fast approximate maximum likelihood (for large datasets)
4. **ETE3** — Python library for tree manipulation and visualization
**Installation:**
```bash
# Conda (recommended for CLI tools)
conda install -c bioconda mafft iqtree fasttree
pip install ete3
```
## When to Use This Skill
Use phylogenetics when:
- **Evolutionary relationships**: Which organism/gene is most closely related to my sequence?
- **Viral phylodynamics**: Trace outbreak spread and estimate transmission dates
- **Protein family analysis**: Infer evolutionary relationships within a gene family
- **Horizontal gene transfer detection**: Identify genes with discordant species/gene trees
- **Ancestral sequence reconstruction**: Infer ancestral protein sequences
- **Molecular clock analysis**: Estimate divergence dates using temporal sampling
- **GWAS companion**: Place variants in evolutionary context (e.g., SARS-CoV-2 variants)
- **Microbiology**: Species phylogeny from 16S rRNA or core genome phylogeny
## Standard Workflow
### 1. Multiple Sequence Alignment with MAFFT
```python
import subprocess
import os
def run_mafft(input_fasta: str, output_fasta: str, method: str = "auto",
n_threads: int = 4) -> str:
"""
Align sequences with MAFFT.
Args:
input_fasta: Path to unaligned FASTA file
output_fasta: Path for aligned output
method: 'auto' (auto-select), 'einsi' (accurate), 'linsi' (accurate, slow),
'fftnsi' (medium), 'fftns' (fast), 'retree2' (fast)
n_threads: Number of CPU threads
Returns:
Path to aligned FASTA file
"""
methods = {
"auto": ["mafft", "--auto"],
"einsi": ["mafft", "--genafpair", "--maxiterate", "1000"],
"linsi": ["mafft", "--localpair", "--maxiterate", "1000"],
"fftnsi": ["mafft", "--fftnsi"],
"fftns": ["mafft", "--fftns"],
"retree2": ["mafft", "--retree", "2"],
}
cmd = methods.get(method, methods["auto"])
cmd += ["--thread", str(n_threads), "--inputorder", input_fasta]
with open(output_fasta, 'w') as out:
result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)
if result.returncode != 0:
raise RuntimeError(f"MAFFT failed:\n{result.stderr}")
# Count aligned sequences
with open(output_fasta) as f:
n_seqs = sum(1 for line in f if line.startswith('>'))
print(f"MAFFT: aligned {n_seqs} sequences → {output_fasta}")
return output_fasta
# MAFFT method selection guide:
# Few sequences (<200), accurate: linsi or einsi
# Many sequences (<1000), moderate: fftnsi
# Large datasets (>1000): fftns or auto
# Ultra-fast (>10000): mafft --retree 1
```
### 2. Trim Alignment (Optional but Recommended)
```python
def trim_alignment_trimal(aligned_fasta: str, output_fasta: str,
method: str = "automated1") -> str:
"""
Trim poorly aligned columns with TrimAl.
Methods:
- 'automated1': Automatic heuristic (recommended)
- 'gappyout': Remove gappy columns
- 'strict': Strict gap threshold
"""
cmd = ["trimal", f"-{method}", "-in", aligned_fasta, "-out", output_fasta, "-fasta"]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print(f"TrimAl warning: {result.stderr}")
# Fall back to using the untrimmed alignment
import shutil
shutil.copy(aligned_fasta, output_fasta)
return output_fasta
```
### 3. IQ-TREE 2 — Maximum Likelihood Tree
```python
def run_iqtree(aligned_fasta: str, output_prefix: str,
model: str = "TEST", bootstrap: int = 1000,
n_threads: int = 4, extra_args: list = None) -> dict:
"""
Build a maximum likelihood tree with IQ-TREE 2.
Args:
aligned_fasta: Aligned FASTA file
output_prefix: Prefix for output files
model: 'TEST' for automatic model selection, or specify (e.g., 'GTR+G' for DNA,
'LG+G4' for proteins, 'JTT+G' for proteins)
bootstrap: Number of ultrafast bootstrap replicates (1000 recommended)
n_threads: Number of threads ('AUTO' to auto-detect)
extra_args: Additional IQ-TREE arguments
Returns:
Dict with paths to output files
"""
cmd = [
"iqtree2",
"-s", aligned_fasta,
"--prefix", output_prefix,
"-m", model,
"-B", str(bootstrap), # Ultrafast bootstrap
"-T", str(n_threads),
"--redo" # Overwrite existing results
]
if extra_args:
cmd.extend(extra_args)
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"IQ-TREE failed:\n{result.stderr}")
# Print model selection result
log_file = f"{output_prefix}.log"
if os.path.exists(log_file):
with open(log_file) as f:
for line in f:
if "Best-fit model" in line:
print(f"IQ-TREE: {line.strip()}")
output_files = {
"tree": f"{output_prefix}.treefile",
"log": f"{output_prefix}.log",
"iqtree": f"{output_prefix}.iqtree", # Full report
"model": f"{output_prefix}.model.gz",
}
print(f"IQ-TREE: Tree saved to {output_files['tree']}")
return output_files
# IQ-TREE model selection guide:
# DNA: TEST → GTR+G, HKY+G, TrN+G
# Protein: TEST → LG+G4, WAG+G, JTT+G, Q.pfam+G
# Codon: TEST → MG+F3X4
# For temporal (molecular clock) analysis, add:
# extra_args = ["--date", "dates.txt", "--clock-test", "--date-CI", "95"]
```
### 4. FastTree — Fast Approximate ML
For large datasets (>1000 sequences) where IQ-TREE is too slow:
```python
def run_fasttree(aligned_fasta: str, output_tree: str,
sequence_type: str = "nt", model: str = "gtr",
n_threads: int = 4) -> str:
"""
Build a fast approximate ML tree with FastTree.
Args:
sequence_type: 'nt' for nucleotide or 'aa' for amino acid
model: For nt: 'gtr' (recommended) or 'jc'; for aa: 'lg', 'wag', 'jtt'
"""
if sequence_type == "nt":
cmd = ["FastTree", "-nt", "-gtr"]
else:
cmd = ["FastTree", f"-{model}"]
cmd += [aligned_fasta]
with open(output_tree, 'w') as out:
result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)
if result.returncode != 0:
raise RuntimeError(f"FastTree failed:\n{result.stderr}")
print(f"FastTree: Tree saved to {output_tree}")
return output_tree
```
### 5. Tree Analysis and Visualization with ETE3
```python
from ete3 import Tree, TreeStyle, NodeStyle, TextFace, PhyloTree
import matplotlib.pyplot as plt
def load_tree(tree_file: str) -> Tree:
"""Load a Newick tree file."""
t = Tree(tree_file)
print(f"Tree: {len(t)} leaves, {len(list(t.traverse()))} nodes")
return t
def basic_tree_stats(t: Tree) -> dict:
"""Compute basic tree statistics."""
leaves = t.get_leaves()
distances = [t.get_distance(l1, l2) for l1 in leaves[:min(50, len(leaves))]
for l2 in leaves[:min(50, len(leaves))] if l1 != l2]
stats = {
"n_leaves": len(leaves),
"n_internal_nodes": len(t) - len(leaves),
"total_branch_length": sum(n.dist for n in t.traverse()),
"max_leaf_distance": max(distances) if distances else 0,
"mean_leaf_distance": sum(distances)/len(distances) if distances else 0,
}
return stats
def find_mrca(t: Tree, leaf_names: list) -> Tree:
"""Find the most recent common ancestor of a set of leaves."""
return t.get_common_ancestor(*leaf_names)
def visualize_tree(t: Tree, output_file: str = "tree.png",
show_branch_support: bool = True,
color_groups: dict = None,
width: int = 800) -> None:
"""
Render phylogenetic tree to image.
Args:
t: ETE3 Tree object
color_groups: Dict mapping leaf_name → color (for coloring taxa)
show_branch_support: Show bootstrap values
"""
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_support = show_branch_support
ts.mode = "r" # 'r' = rectangular, 'c' = circular
if color_groups:
for node in t.traverse():
if node.is_leaf() and node.name in color_groups:
nstyle = NodeStyle()
nstyle["fgcolor"] = color_groups[node.name]
nstyle["size"] = 8
node.set_style(nstyle)
t.render(output_file, tree_style=ts, w=width, units="px")
print(f"Tree saved to: {output_file}")
def midpoint_root(t: Tree) -> Tree:
"""Root tree at midpoint (use when outgroup unknown)."""
t.set_outgroup(t.get_midpoint_outgroup())
return t
def prune_tree(t: Tree, keep_leaves: list) -> Tree:
"""Prune tree to keep only specified leaves."""
t.prune(keep_leaves, preserve_branch_length=True)
return t
```
### 6. Complete Analysis Script
```python
import subprocess, os
from ete3 import Tree
def full_phylogenetic_analysis(
input_fasta: str,
output_dir: str = "phylo_results",
sequence_type: str = "nt",
n_threads: int = 4,
bootstrap: int = 1000,
use_fasttree: bool = False
) -> dict:
"""
Complete phylogenetic pipeline: align → trim → tree → visualize.
Args:
input_fasta: Unaligned FASTA
sequence_type: 'nt' (nucleotide) or 'aa' (amino acid/protein)
use_fasttree: Use FastTree instead of IQ-TREE (faster for large datasets)
"""
os.makedirs(output_dir, exist_ok=True)
prefix = os.path.join(output_dir, "phylo")
print("=" * 50)
print("Step 1: Multiple Sequence Alignment (MAFFT)")
aligned = run_mafft(input_fasta, f"{prefix}_aligned.fasta",
method="auto", n_threads=n_threads)
print("\nStep 2: Tree Inference")
if use_fasttree:
tree_file = run_fasttree(
aligned, f"{prefix}.tree",
sequence_type=sequence_type,
model="gtr" if sequence_type == "nt" else "lg"
)
else:
model = "TEST" if sequence_type == "nt" else "TEST"
iqtree_files = run_iqtree(
aligned, prefix,
model=model,
bootstrap=bootstrap,
n_threads=n_threads
)
tree_file = iqtree_files["tree"]
print("\nStep 3: Tree Analysis")
t = Tree(tree_file)
t = midpoint_root(t)
stats = basic_tree_stats(t)
print(f"Tree statistics: {stats}")
print("\nStep 4: Visualization")
visualize_tree(t, f"{prefix}_tree.png", show_branch_support=True)
# Save rooted tree
rooted_tree_file = f"{prefix}_rooted.nwk"
t.write(format=1, outfile=rooted_tree_file)
results = {
"aligned_fasta": aligned,
"tree_file": tree_file,
"rooted_tree": rooted_tree_file,
"visualization": f"{prefix}_tree.png",
"stats": stats
}
print("\n" + "=" * 50)
print("Phylogenetic analysis complete!")
print(f"Results in: {output_dir}/")
return results
```
## IQ-TREE Model Guide
### DNA Models
| Model | Description | Use case |
|-------|-------------|---------|
| `GTR+G4` | General Time Reversible + Gamma | Most flexible DNA model |
| `HKY+G4` | Hasegawa-Kishino-Yano + Gamma | Two-rate model (common) |
| `TrN+G4` | Tamura-Nei | Unequal transitions |
| `JC` | Jukes-Cantor | Simplest; all rates equal |
### Protein Models
| Model | Description | Use case |
|-------|-------------|---------|
| `LG+G4` | Le-Gascuel + Gamma | Best average protein model |
| `WAG+G4` | Whelan-Goldman | Widely used |
| `JTT+G4` | Jones-Taylor-Thornton | Classical model |
| `Q.pfam+G4` | pfam-trained | For Pfam-like protein families |
| `Q.bird+G4` | Bird-specific | Vertebrate proteins |
**Tip:** Use `-m TEST` to let IQ-TREE automatically select the best model.
## Best Practices
- **Alignment quality first**: Poor alignment → unreliable trees; check alignment manually
- **Use `linsi` for small (<200 seq), `fftns` or `auto` for large alignments**
- **Model selection**: Always use `-m TEST` for IQ-TREE unless you have a specific reason
- **Bootstrap**: Use ≥1000 ultrafast bootstraps (`-B 1000`) for branch support
- **Root the tree**: Unrooted trees can be misleading; use outgroup or midpoint rooting
- **FastTree for >5000 sequences**: IQ-TREE becomes slow; FastTree is 10–100× faster
- **Trim long alignments**: TrimAl removes unreliable columns; improves tree accuracy
- **Check for recombination** in viral/bacterial sequences before building trees (`RDP4`, `GARD`)
## Additional Resources
- **MAFFT**: https://mafft.cbrc.jp/alignment/software/
- **IQ-TREE 2**: http://www.iqtree.org/ | Tutorial: https://www.iqtree.org/workshop/molevol2022
- **FastTree**: http://www.microbesonline.org/fasttree/
- **ETE3**: http://etetoolkit.org/
- **FigTree** (GUI visualization): https://tree.bio.ed.ac.uk/software/figtree/
- **iTOL** (web visualization): https://itol.embl.de/
- **MUSCLE** (alternative aligner): https://www.drive5.com/muscle/
- **TrimAl** (alignment trimming): https://vicfero.github.io/trimal/Related Skills
tooluniverse-phylogenetics
Production-ready phylogenetics and sequence analysis skill for alignment processing, tree analysis, and evolutionary metrics. Computes treeness, RCV, treeness/RCV, parsimony informative sites, evolutionary rate, DVMC, tree length, alignment gap statistics, GC content, and bootstrap support using PhyKIT, Biopython, and DendroPy. Performs NJ/UPGMA/parsimony tree construction, Robinson-Foulds distance, Mann-Whitney U tests, and batch analysis across gene families. Integrates with ToolUniverse for sequence retrieval (NCBI, UniProt, Ensembl) and tree annotation. Use when processing FASTA/PHYLIP/Nexus/Newick files, computing phylogenetic metrics, comparing taxa groups, or answering questions about alignments, trees, parsimony, or molecular evolution.
zinc-database
Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.
zarr-python
Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.
xlsx
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
writing-skills
Use when creating new skills, editing existing skills, or verifying skills work before deployment
writing-plans
Use when you have a spec or requirements for a multi-step task, before touching code
wikipedia-search
Search and fetch structured content from Wikipedia using the MediaWiki API for reliable, encyclopedic information
wellally-tech
Integrate digital health data sources (Apple Health, Fitbit, Oura Ring) and connect to WellAlly.tech knowledge base. Import external health device data, standardize to local format, and recommend relevant WellAlly.tech knowledge base articles based on health data. Support generic CSV/JSON import, provide intelligent article recommendations, and help users better manage personal health data.
weightloss-analyzer
分析减肥数据、计算代谢率、追踪能量缺口、管理减肥阶段
<!--
# COPYRIGHT NOTICE
verification-before-completion
Use when about to claim work is complete, fixed, or passing, before committing or creating PRs - requires running verification commands and confirming output before making any success claims; evidence before assertions always
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.