bio-metagenomics-metaphlan
Marker gene-based taxonomic profiling using MetaPhlAn 4. Provides accurate species-level relative abundances using clade-specific markers. Use when accurate taxonomic profiling is needed and computational resources are limited, or for comparison with HMP/other MetaPhlAn studies.
Best use case
bio-metagenomics-metaphlan is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Marker gene-based taxonomic profiling using MetaPhlAn 4. Provides accurate species-level relative abundances using clade-specific markers. Use when accurate taxonomic profiling is needed and computational resources are limited, or for comparison with HMP/other MetaPhlAn studies.
Teams using bio-metagenomics-metaphlan should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/bio-metagenomics-metaphlan/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How bio-metagenomics-metaphlan Compares
| Feature / Agent | bio-metagenomics-metaphlan | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Marker gene-based taxonomic profiling using MetaPhlAn 4. Provides accurate species-level relative abundances using clade-specific markers. Use when accurate taxonomic profiling is needed and computational resources are limited, or for comparison with HMP/other MetaPhlAn studies.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
## Version Compatibility
Reference examples tested with: Bowtie2 2.5.3+, MetaPhlAn 4.1+, minimap2 2.26+, pandas 2.2+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# MetaPhlAn 4 Profiling
**"Profile the species composition of my metagenome"** → Determine species-level relative abundances from shotgun metagenomic reads using clade-specific marker gene alignment.
- CLI: `metaphlan sample.fastq --input_type fastq -o profile.txt`
MetaPhlAn 4 uses ~5M clade-specific markers from 26,970 species-level genome bins. Supports both short reads (bowtie2) and long reads (minimap2).
## Basic Profiling
```bash
# Profile single sample
metaphlan sample.fastq.gz \
--input_type fastq \
--output_file profile.txt
```
## Paired-End Reads
```bash
# MetaPhlAn processes PE as single file or concatenated
metaphlan reads_R1.fastq.gz,reads_R2.fastq.gz \
--input_type fastq \
--output_file profile.txt \
--mapout sample.map.bz2
```
## Save Mapping Output for Reuse
```bash
# First run - save intermediate mapping
metaphlan sample.fastq.gz \
--input_type fastq \
--mapout sample.map.bz2 \
--output_file profile.txt
# Rerun with different settings without realigning
metaphlan sample.map.bz2 \
--input_type mapout \
--output_file profile_v2.txt
```
## Long-Read Support (MetaPhlAn 4+)
```bash
# Long reads automatically use minimap2 instead of bowtie2
metaphlan long_reads.fastq.gz \
--input_type fastq \
--output_file profile.txt
```
## Common Options
```bash
metaphlan sample.fastq.gz \
--input_type fastq \
--nproc 8 \ # CPU threads
--tax_lev s \ # Taxonomic level (k,p,c,o,f,g,s,t)
--min_cu_len 2000 \ # Min total nucleotide length
--stat_q 0.2 \ # Quantile for robust average
--output_file profile.txt \
--mapout sample.map.bz2
```
## Install Database
```bash
# Download database (done automatically on first run)
metaphlan --install
# Or specify database location
metaphlan --install --db_dir /path/to/db
```
## Analysis Types
```bash
# Relative abundances (default)
metaphlan sample.fastq.gz --input_type fastq -t rel_ab
# Relative abundances with read counts
metaphlan sample.fastq.gz --input_type fastq -t rel_ab_w_read_stats
# Marker presence/absence
metaphlan sample.fastq.gz --input_type fastq -t marker_pres_table
# Marker abundances
metaphlan sample.fastq.gz --input_type fastq -t marker_ab_table
```
## Multiple Samples
```bash
# Process each sample
for fq in samples/*.fastq.gz; do
sample=$(basename $fq .fastq.gz)
metaphlan $fq \
--input_type fastq \
--nproc 4 \
--output_file profiles/${sample}_profile.txt \
--mapout mapout/${sample}.map.bz2
done
# Merge profiles
merge_metaphlan_tables.py profiles/*_profile.txt > merged_abundance.txt
```
## Filter by Taxonomic Level
```bash
# Species only
metaphlan sample.fastq.gz --input_type fastq --tax_lev s -o species.txt
# Genus only
metaphlan sample.fastq.gz --input_type fastq --tax_lev g -o genus.txt
# All levels (default)
metaphlan sample.fastq.gz --input_type fastq --tax_lev a -o all_levels.txt
```
## Output Format
```
#SampleID sample
#clade_name relative_abundance
k__Bacteria 100.0
k__Bacteria|p__Proteobacteria 65.23
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria 62.15
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales 58.42
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae 55.21
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia 52.33
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli 52.33
```
## Parse Output in Python
```python
import pandas as pd
profile = pd.read_csv('profile.txt', sep='\t', comment='#', header=None,
names=['clade', 'abundance'])
species = profile[profile['clade'].str.contains('\\|s__')]
species['species'] = species['clade'].str.split('|').str[-1].str.replace('s__', '')
species.sort_values('abundance', ascending=False).head(20)
```
## Extract SGBs (Strain-level)
```bash
# Include strain-level genomic bins
metaphlan sample.fastq.gz \
--input_type fastq \
--tax_lev t \ # Include t__ level (SGBs)
--output_file profile_with_sgb.txt
```
## Sample Metadata in Output
```bash
# Add sample ID to output
metaphlan sample.fastq.gz \
--input_type fastq \
--sample_id sample_name \
--output_file profile.txt
```
## Key Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| --input_type | fastq | Input format (fastq, mapout) |
| --nproc | 4 | CPU threads |
| --tax_lev | a | Taxonomic level (a=all) |
| --stat_q | 0.2 | Quantile value |
| --min_cu_len | 2000 | Min clade length |
| -t | rel_ab | Analysis type |
| --mapout | none | Save mapping output |
| --db_dir | default | Database directory |
Note: Unknown species estimation is now enabled by default in MetaPhlAn 4.2+
## Analysis Types (-t)
| Type | Description |
|------|-------------|
| rel_ab | Relative abundances (%) |
| rel_ab_w_read_stats | With read statistics |
| marker_pres_table | Marker presence/absence |
| marker_ab_table | Marker abundances |
| clade_specific_strain_tracker | Strain tracking |
## Related Skills
- kraken-classification - Alternative k-mer based classification
- abundance-estimation - Bracken for Kraken2 abundances
- metagenome-visualization - Visualize profilesRelated Skills
claw-metagenomics
Shotgun metagenomics profiling — taxonomy, resistome, and functional pathways
bio-metagenomics-visualization
Visualize metagenomic profiles using R (phyloseq, microbiome) and Python (matplotlib, seaborn). Create stacked bar plots, heatmaps, PCA plots, and diversity analyses. Use when creating publication-quality figures from MetaPhlAn, Bracken, or other taxonomic profiling output.
bio-metagenomics-strain-tracking
Track bacterial strains using MASH, sourmash, fastANI, and inStrain. Compare genomes, detect contamination, and monitor strain-level variation. Use when needing sub-species resolution for outbreak tracking, transmission analysis, or within-host strain dynamics.
bio-metagenomics-kraken
Taxonomic classification of metagenomic reads using Kraken2. Fast k-mer based classification against RefSeq database. Use when performing initial taxonomic classification of shotgun metagenomic reads before abundance estimation with Bracken.
bio-metagenomics-functional-profiling
Profile functional potential of metagenomes using HUMAnN3 and similar tools. Use when obtaining pathway abundances, gene family counts, or functional annotations from metagenomic data.
bio-metagenomics-amr-detection
Detect antimicrobial resistance genes using AMRFinderPlus, ResFinder, and CARD. Screen isolates and metagenomes for resistance determinants. Use when characterizing resistance profiles in clinical isolates, surveillance samples, or metagenomic data.
bio-metagenomics-abundance
Species abundance estimation using Bracken with Kraken2 output. Redistributes reads from higher taxonomic levels to species for more accurate estimates. Use when accurate species-level abundances are needed from Kraken2 classification output.
zinc-database
Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.
zarr-python
Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.
xlsx
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
writing-skills
Use when creating new skills, editing existing skills, or verifying skills work before deployment
writing-plans
Use when you have a spec or requirements for a multi-step task, before touching code