claw-metagenomics

Shotgun metagenomics profiling — taxonomy, resistome, and functional pathways

1,802 stars

Best use case

claw-metagenomics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Shotgun metagenomics profiling — taxonomy, resistome, and functional pathways

Teams using claw-metagenomics should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/claw-metagenomics/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/claw-metagenomics/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/claw-metagenomics/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How claw-metagenomics Compares

Feature / Agentclaw-metagenomicsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Shotgun metagenomics profiling — taxonomy, resistome, and functional pathways

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Shotgun Metagenomics Profiler

Comprehensive shotgun metagenomics analysis combining taxonomic classification, antimicrobial resistance gene detection, and functional pathway profiling from paired-end FASTQ files.

## What it does

1. Takes paired-end FASTQ files (R1, R2) or a single concatenated FASTQ as input
2. Runs **Kraken2** taxonomic classification against a standard database (e.g., Standard-8, PlusPF)
3. Refines abundances with **Bracken** at species level (read re-estimation)
4. Detects antimicrobial resistance genes with **RGI** against the **CARD** database
5. Classifies detected ARGs by **WHO critical priority pathogen** association
6. Optionally runs **HUMAnN3** for functional pathway profiling (MetaCyc + UniRef)
7. Generates three publication-quality figures:
   - **Figure 1**: Taxonomy bar chart — top 20 species by relative abundance
   - **Figure 2**: Resistome heatmap — ARG families by drug class with abundance
   - **Figure 3**: WHO-critical ARG summary — priority-tier breakdown of detected resistance genes
8. Produces a full reproducibility bundle (commands.sh, environment.yml, checksums.sha256)

## Why this exists

If you ask a general AI to "analyse a metagenome," it will:
- Not know which Kraken2 database to use or how to set confidence thresholds
- Hallucinate Bracken parameters for read-length and taxonomic level
- Miss the connection between detected ARGs and WHO priority pathogen lists
- Skip HUMAnN3 entirely (or misconfigure its database paths)
- Produce a single bar chart with no resistance context
- Not provide a reproducibility bundle

This skill encodes the correct methodological decisions:
- Kraken2 confidence threshold of 0.2 (reduces false positives in environmental samples)
- Bracken re-estimation at species level with minimum 10 reads
- RGI MAIN with "Perfect" and "Strict" hit criteria only (no "Loose" hits)
- WHO Critical Priority Pathogen list mapped to detected ARG families
- HUMAnN3 with MetaCyc stratification for pathway-level functional context
- Thread count auto-detected from available CPUs
- Full reproducibility bundle for every run

## Validated On

The skill works with any shotgun metagenome but has been validated on:
- **Peru sewage metagenomics study** (6 samples, 3 collection sites: Lima, Cusco, Iquitos)
- Environmental sewage samples with mixed microbial communities
- Read depths ranging from 2M to 15M paired-end reads per sample

## WHO-Critical ARG Detection

A key feature is the classification of detected resistance genes by WHO priority tier:

| Priority | Pathogen | Resistance |
|----------|----------|------------|
| Critical | *Acinetobacter baumannii* | Carbapenem-resistant |
| Critical | *Pseudomonas aeruginosa* | Carbapenem-resistant |
| Critical | *Enterobacteriaceae* | Carbapenem-resistant, 3rd-gen cephalosporin-resistant |
| High | *Enterococcus faecium* | Vancomycin-resistant |
| High | *Staphylococcus aureus* | Methicillin-resistant, vancomycin-resistant |
| High | *Helicobacter pylori* | Clarithromycin-resistant |
| High | *Campylobacter* | Fluoroquinolone-resistant |
| High | *Salmonella* spp. | Fluoroquinolone-resistant |
| High | *Neisseria gonorrhoeae* | 3rd-gen cephalosporin-resistant, fluoroquinolone-resistant |
| Medium | *Streptococcus pneumoniae* | Penicillin-non-susceptible |
| Medium | *Haemophilus influenzae* | Ampicillin-resistant |
| Medium | *Shigella* spp. | Fluoroquinolone-resistant |

## Usage

```bash
# Full pipeline (taxonomy + resistome + functional)
python metagenomics_profiler.py \
    --r1 sample_R1.fastq.gz \
    --r2 sample_R2.fastq.gz \
    --output metagenomics_report

# Skip HUMAnN3 (faster — taxonomy + resistome only)
python metagenomics_profiler.py \
    --r1 sample_R1.fastq.gz \
    --r2 sample_R2.fastq.gz \
    --output metagenomics_report \
    --skip-functional

# Single concatenated FASTQ
python metagenomics_profiler.py \
    --input combined.fastq.gz \
    --output metagenomics_report

# Specify Kraken2 database path
python metagenomics_profiler.py \
    --r1 sample_R1.fastq.gz \
    --r2 sample_R2.fastq.gz \
    --output metagenomics_report \
    --kraken2-db /path/to/kraken2_db \
    --read-length 150
```

### Demo (works out of the box)

```bash
python metagenomics_profiler.py --demo --output demo_report
```

The demo uses pre-computed results from the Peru sewage metagenomics study (6 samples, 3 sites) and generates all figures and reports instantly without requiring external tools.

## Example Output

```
Metagenomics Profiler — ClawBio
================================
Mode: demo (pre-computed Peru sewage data)
Samples: 6 (3 sites: Lima, Cusco, Iquitos)

Taxonomy (Kraken2 + Bracken):
  Total classified: 94.2%
  Top species: Escherichia coli (12.3%), Klebsiella pneumoniae (8.7%),
               Pseudomonas aeruginosa (5.1%), Acinetobacter baumannii (3.9%)

Resistome (RGI/CARD):
  Total ARG hits: 247 (Perfect: 89, Strict: 158)
  Drug classes: 14
  WHO-Critical ARGs detected: 23
    - Carbapenem resistance: NDM-1, OXA-48, KPC-3
    - 3rd-gen cephalosporin resistance: CTX-M-15, CTX-M-27

Functional Pathways (HUMAnN3):
  Total pathways: 312
  Top: PWY-7219 (adenosine ribonucleotides de novo biosynthesis)

Figures saved to: demo_report/figures/
  taxonomy_barplot.png (300 dpi)
  resistome_heatmap.png (300 dpi)
  who_critical_args.png (300 dpi)

Reproducibility:
  commands.sh | environment.yml | checksums.sha256
```

## Pipeline Architecture

```
FASTQ R1 + R2
     |
     v
[Kraken2] --> kraken2_report.txt
     |
     v
[Bracken] --> bracken_species.tsv   --> Figure 1: Taxonomy bar chart
     |
     v
[RGI MAIN] --> rgi_results.txt      --> Figure 2: Resistome heatmap
     |                                --> Figure 3: WHO-critical ARG summary
     v
[HUMAnN3] --> pathabundance.tsv     (optional, --skip-functional to omit)
     |
     v
[Report] --> report.md + figures/ + reproducibility/
```

## Database Requirements

| Tool | Database | Size | Notes |
|------|----------|------|-------|
| Kraken2 | Standard-8 or PlusPF | 8-70 GB | Set via `--kraken2-db` or `$KRAKEN2_DB` |
| Bracken | (built from Kraken2 DB) | included | Read-length specific (default: 150 bp) |
| RGI | CARD | ~500 MB | Auto-downloaded via `rgi auto_load` |
| HUMAnN3 | ChocoPhlAn + UniRef90 | ~15 GB | Set via `--humann-db` or `$HUMANN_DB` |

## Citations

If you use this skill in a publication, please cite:

- Wood, D.E., Lu, J. & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20, 257.
- Lu, J. et al. (2017). Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science, 3, e104.
- Alcock, B.P. et al. (2023). CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Research, 51(D1), D419-D430.
- Beghini, F. et al. (2021). Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife, 10, e65088.
- Corpas, M. (2026). ClawBio. https://github.com/ClawBio/ClawBio

Related Skills

claw-semantic-sim

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Semantic Similarity Index for disease research literature using PubMedBERT embeddings

claw-ancestry-pca

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Ancestry decomposition PCA against the Simons Genome Diversity Project

bio-metagenomics-visualization

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Visualize metagenomic profiles using R (phyloseq, microbiome) and Python (matplotlib, seaborn). Create stacked bar plots, heatmaps, PCA plots, and diversity analyses. Use when creating publication-quality figures from MetaPhlAn, Bracken, or other taxonomic profiling output.

bio-metagenomics-strain-tracking

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Track bacterial strains using MASH, sourmash, fastANI, and inStrain. Compare genomes, detect contamination, and monitor strain-level variation. Use when needing sub-species resolution for outbreak tracking, transmission analysis, or within-host strain dynamics.

bio-metagenomics-metaphlan

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Marker gene-based taxonomic profiling using MetaPhlAn 4. Provides accurate species-level relative abundances using clade-specific markers. Use when accurate taxonomic profiling is needed and computational resources are limited, or for comparison with HMP/other MetaPhlAn studies.

bio-metagenomics-kraken

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Taxonomic classification of metagenomic reads using Kraken2. Fast k-mer based classification against RefSeq database. Use when performing initial taxonomic classification of shotgun metagenomic reads before abundance estimation with Bracken.

bio-metagenomics-functional-profiling

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Profile functional potential of metagenomes using HUMAnN3 and similar tools. Use when obtaining pathway abundances, gene family counts, or functional annotations from metagenomic data.

bio-metagenomics-amr-detection

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Detect antimicrobial resistance genes using AMRFinderPlus, ResFinder, and CARD. Screen isolates and metagenomes for resistance determinants. Use when characterizing resistance profiles in clinical isolates, surveillance samples, or metagenomic data.

bio-metagenomics-abundance

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Species abundance estimation using Bracken with Kraken2 output. Redistributes reads from higher taxonomic levels to species for more accurate estimates. Use when accurate species-level abundances are needed from Kraken2 classification output.

zinc-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

zarr-python

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

xlsx

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.