bio-microbiome-taxonomy-assignment

Taxonomic classification of ASVs using reference databases like SILVA, GTDB, or UNITE. Covers naive Bayes classifiers (DADA2, IDTAXA) and exact matching approaches. Use when assigning taxonomy to ASVs after DADA2 amplicon processing.

1,802 stars

Best use case

bio-microbiome-taxonomy-assignment is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Taxonomic classification of ASVs using reference databases like SILVA, GTDB, or UNITE. Covers naive Bayes classifiers (DADA2, IDTAXA) and exact matching approaches. Use when assigning taxonomy to ASVs after DADA2 amplicon processing.

Teams using bio-microbiome-taxonomy-assignment should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-microbiome-taxonomy-assignment/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-microbiome-taxonomy-assignment/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/bio-microbiome-taxonomy-assignment/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How bio-microbiome-taxonomy-assignment Compares

Feature / Agentbio-microbiome-taxonomy-assignmentStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Taxonomic classification of ASVs using reference databases like SILVA, GTDB, or UNITE. Covers naive Bayes classifiers (DADA2, IDTAXA) and exact matching approaches. Use when assigning taxonomy to ASVs after DADA2 amplicon processing.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Version Compatibility

Reference examples tested with: DADA2 1.30+, QIIME2 2024.2+, phyloseq 1.46+, scanpy 1.10+, scikit-learn 1.4+

Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Taxonomy Assignment

**"Assign taxonomy to my ASVs"** → Classify amplicon sequence variants against reference databases (SILVA, GTDB, UNITE) using naive Bayes or exact-matching approaches for taxonomic annotation.
- R: `dada2::assignTaxonomy()` with SILVA/GTDB reference
- CLI: `qiime feature-classifier classify-sklearn` for QIIME2 workflows

## DADA2 Naive Bayes Classifier

```r
library(dada2)

seqtab_nochim <- readRDS('seqtab_nochim.rds')

# SILVA for 16S (download from https://zenodo.org/record/4587955)
taxa <- assignTaxonomy(seqtab_nochim, 'silva_nr99_v138.1_train_set.fa.gz',
                       multithread = TRUE)

# Add species-level (exact matching)
taxa <- addSpecies(taxa, 'silva_species_assignment_v138.1.fa.gz')

# Check results
head(taxa)
```

## GTDB for 16S

```r
# GTDB-formatted database (better for environmental samples)
taxa_gtdb <- assignTaxonomy(seqtab_nochim, 'GTDB_bac120_arc53_ssu_r220_fullTaxo.fa.gz',
                            multithread = TRUE)
```

## UNITE for ITS (Fungi)

```r
# UNITE database for fungal ITS
taxa_its <- assignTaxonomy(seqtab_nochim, 'sh_general_release_dynamic_25.07.2023.fasta',
                           multithread = TRUE)
```

## QIIME2 Feature Classifier

```bash
# Train classifier (one-time)
qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads silva-138-99-seqs.qza \
    --i-reference-taxonomy silva-138-99-tax.qza \
    --o-classifier silva-138-99-nb-classifier.qza

# Classify ASVs
qiime feature-classifier classify-sklearn \
    --i-classifier silva-138-99-nb-classifier.qza \
    --i-reads rep-seqs.qza \
    --o-classification taxonomy.qza
```

## VSEARCH Exact Matching

```bash
# Faster but requires exact or near-exact matches
vsearch --usearch_global asv_seqs.fasta \
    --db silva_138_SSURef_NR99.fasta \
    --id 0.97 \
    --blast6out taxonomy_vsearch.tsv \
    --top_hits_only
```

## RDP Classifier

```r
library(dada2)

# RDP training set (less detailed than SILVA)
taxa_rdp <- assignTaxonomy(seqtab_nochim, 'rdp_train_set_18.fa.gz',
                           multithread = TRUE)
```

## IDTAXA (DECIPHER) - Often More Accurate

**Goal:** Classify ASVs using DECIPHER's tree-based IDTAXA classifier, which provides more conservative and often more accurate assignments than naive Bayes.

**Approach:** Convert ASV sequences to DNAStringSet, classify against a pre-trained IDTAXA model, and convert the hierarchical output to a standard taxonomy matrix.

```r
library(DECIPHER)

# Load IDTAXA training set (download from http://www2.decipher.codes/Downloads.html)
load('SILVA_SSU_r138_2019.RData')  # Creates 'trainingSet' object

# Convert ASV sequences to DNAStringSet
dna <- DNAStringSet(getSequences(seqtab_nochim))

# Classify with IDTAXA
ids <- IdTaxa(dna, trainingSet, strand = 'top', processors = NULL, verbose = TRUE)

# Convert to matrix format like assignTaxonomy
ranks <- c('domain', 'phylum', 'class', 'order', 'family', 'genus', 'species')
taxa_idtaxa <- t(sapply(ids, function(x) {
    m <- match(ranks, x$rank)
    taxa <- x$taxon[m]
    taxa[startsWith(taxa, 'unclassified_')] <- NA
    taxa
}))
colnames(taxa_idtaxa) <- ranks
```

## Confidence Filtering

```r
# assignTaxonomy returns bootstrap confidence
# Filter low-confidence assignments
taxa_filtered <- taxa
taxa_filtered[taxa_filtered < 80] <- NA  # If using minBoot output

# Or use confidence threshold during assignment
taxa <- assignTaxonomy(seqtab_nochim, 'silva_nr99_v138.1_train_set.fa.gz',
                       minBoot = 80, multithread = TRUE)
```

## Combine into phyloseq

```r
library(phyloseq)

# Create phyloseq object
ps <- phyloseq(otu_table(seqtab_nochim, taxa_are_rows = FALSE),
               tax_table(taxa))

# Add sample metadata
sample_data(ps) <- read.csv('sample_metadata.csv', row.names = 1)

# Rename ASVs for readability
taxa_names(ps) <- paste0('ASV', seq(ntaxa(ps)))
```

## Database Comparison

| Database | Organisms | Taxonomy | Updated |
|----------|-----------|----------|---------|
| SILVA 138.1 | Bacteria, Archaea, Eukaryotes | 7 ranks | 2024 |
| GTDB R220 | Bacteria, Archaea | 7 ranks (genome-based) | 2024 |
| RDP 18 | Bacteria, Archaea | 6 ranks | 2016 |
| UNITE 10.0 | Fungi | 7 ranks | 2024 |
| PR2 5.0 | Protists | 8 ranks | 2024 |

## Related Skills

- amplicon-processing - Generate ASV table for classification
- diversity-analysis - Analyze classified communities
- metagenomics/kraken-classification - Read-level taxonomic classification

Related Skills

bio-microbiome-qiime2-workflow

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

QIIME2 command-line workflow for 16S/ITS amplicon analysis. Alternative to DADA2/phyloseq R workflow with built-in provenance tracking. Use when preferring CLI over R, needing reproducible provenance, or working within QIIME2 ecosystem.

bio-microbiome-functional-prediction

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Predict metagenome functional content from 16S rRNA marker gene data using PICRUSt2. Infer KEGG, MetaCyc, and EC abundances from ASV tables. Use when functional profiling is needed from 16S data without shotgun metagenomics sequencing.

bio-microbiome-diversity-analysis

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Alpha and beta diversity analysis for microbiome data. Calculate within-sample richness, evenness, and between-sample dissimilarity with phyloseq and vegan. Use when comparing community composition across samples or testing for group differences in microbiome structure.

bio-microbiome-differential-abundance

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Differential abundance testing for microbiome data using compositionally-aware methods like ALDEx2, ANCOM-BC2, and MaAsLin2. Use when identifying taxa that differ between experimental groups while accounting for the compositional nature of microbiome data.

bio-microbiome-amplicon-processing

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Amplicon sequence variant (ASV) inference from 16S rRNA or ITS amplicon sequencing using DADA2. Covers quality filtering, error learning, denoising, and chimera removal. Use when processing demultiplexed amplicon FASTQ files to generate an ASV table for downstream analysis.

zinc-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

zarr-python

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

xlsx

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

writing-skills

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Use when creating new skills, editing existing skills, or verifying skills work before deployment

writing-plans

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Use when you have a spec or requirements for a multi-step task, before touching code

wikipedia-search

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Search and fetch structured content from Wikipedia using the MediaWiki API for reliable, encyclopedic information

wellally-tech

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Integrate digital health data sources (Apple Health, Fitbit, Oura Ring) and connect to WellAlly.tech knowledge base. Import external health device data, standardize to local format, and recommend relevant WellAlly.tech knowledge base articles based on health data. Support generic CSV/JSON import, provide intelligent article recommendations, and help users better manage personal health data.