bio-epidemiological-genomics-pathogen-typing

Perform multi-locus sequence typing (MLST), core genome MLST, and SNP-based strain typing for bacterial isolate characterization using mlst and chewBBACA. Use when identifying strain types, tracking outbreak clones, or characterizing bacterial isolates.

1,802 stars

Best use case

bio-epidemiological-genomics-pathogen-typing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Perform multi-locus sequence typing (MLST), core genome MLST, and SNP-based strain typing for bacterial isolate characterization using mlst and chewBBACA. Use when identifying strain types, tracking outbreak clones, or characterizing bacterial isolates.

Teams using bio-epidemiological-genomics-pathogen-typing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-epidemiological-genomics-pathogen-typing/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-epidemiological-genomics-pathogen-typing/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/bio-epidemiological-genomics-pathogen-typing/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How bio-epidemiological-genomics-pathogen-typing Compares

Feature / Agentbio-epidemiological-genomics-pathogen-typingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Perform multi-locus sequence typing (MLST), core genome MLST, and SNP-based strain typing for bacterial isolate characterization using mlst and chewBBACA. Use when identifying strain types, tracking outbreak clones, or characterizing bacterial isolates.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Version Compatibility

Reference examples tested with: mlst 2.23+, numpy 1.26+, pandas 2.2+, scanpy 1.10+, scipy 1.12+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Pathogen Typing

**"Type my bacterial isolates by MLST"** → Assign multi-locus sequence types to bacterial genomes for isolate characterization, outbreak clone identification, and strain tracking.
- CLI: `mlst assembly.fasta` for 7-gene MLST typing
- CLI: `chewBBACA.py AlleleCall` for core genome MLST (cgMLST)

## MLST with mlst Tool

```bash
# Install mlst
conda install -c bioconda mlst

# Basic MLST typing
mlst genome.fasta
# Output: genome.fasta  ecoli  ST131  adk(53) fumC(40) gyrB(47) ...

# Batch typing
mlst *.fasta > typing_results.tsv

# Specify scheme
mlst --scheme senterica genome.fasta

# List available schemes
mlst --list

# Include allele sequences in output
mlst --csv genome.fasta > results.csv
```

## Parse MLST Results

```python
import pandas as pd
import subprocess

def run_mlst(fasta_files, scheme=None):
    '''Run MLST on multiple genomes

    Returns DataFrame with:
    - Sample name
    - Scheme (auto-detected or specified)
    - Sequence type (ST)
    - Allele profiles

    ST interpretation:
    - Known ST: Matches existing type in database
    - Novel allele: New allele combination, may be unreported ST
    - Failed: Unable to determine (poor assembly or wrong scheme)
    '''
    cmd = ['mlst'] + fasta_files
    if scheme:
        cmd.extend(['--scheme', scheme])

    result = subprocess.run(cmd, capture_output=True, text=True)

    lines = result.stdout.strip().split('\n')
    data = [line.split('\t') for line in lines]

    return pd.DataFrame(data, columns=['file', 'scheme', 'ST'] +
                       [f'locus{i}' for i in range(1, len(data[0])-2)])
```

## Core Genome MLST (cgMLST)

```bash
# chewBBACA for cgMLST
pip install chewbbaca

# Download or create schema
chewBBACA.py DownloadSchema -sp "Salmonella enterica" -o schema_dir

# Run cgMLST
chewBBACA.py AlleleCall -i genomes/ -g schema_dir -o results/

# Analyze results
chewBBACA.py ExtractCgMLST -i results/results_alleles.tsv \
    -o cgmlst_results.tsv --threshold 0.95
```

## cgMLST Distance Analysis

**Goal:** Compute pairwise allelic distances between isolates and cluster them to identify potential outbreak groups.

**Approach:** Count allelic differences between each pair of isolate profiles (ignoring missing data), then apply single-linkage hierarchical clustering with a pathogen-specific distance threshold.

```python
import pandas as pd
import numpy as np

def calculate_cgmlst_distance(profiles):
    '''Calculate allelic distances between isolates

    Distance interpretation (typical thresholds):
    - 0-5 allele differences: Same cluster (likely recent transmission)
    - 6-15 differences: Related (possible epidemiological link)
    - >15 differences: Different clones

    Note: Thresholds are pathogen-specific. Consult literature.
    '''
    n = len(profiles)
    distances = np.zeros((n, n))

    for i in range(n):
        for j in range(i+1, n):
            # Count allelic differences (excluding missing data)
            diff = sum(1 for a, b in zip(profiles.iloc[i], profiles.iloc[j])
                      if a != b and a != 0 and b != 0)
            distances[i, j] = distances[j, i] = diff

    return pd.DataFrame(distances, index=profiles.index, columns=profiles.index)


def identify_clusters(distance_matrix, threshold=5):
    '''Identify cgMLST clusters

    Threshold values by organism:
    - E. coli: 10 alleles
    - Salmonella: 7 alleles
    - Listeria: 7 alleles
    - S. aureus: 24 alleles
    '''
    from scipy.cluster.hierarchy import linkage, fcluster

    # Convert to condensed distance matrix
    condensed = distance_matrix.values[np.triu_indices(len(distance_matrix), k=1)]

    # Hierarchical clustering
    Z = linkage(condensed, method='single')
    clusters = fcluster(Z, t=threshold, criterion='distance')

    return dict(zip(distance_matrix.index, clusters))
```

## SNP-Based Typing

```python
def snp_typing_from_vcf(vcf_file, reference_positions):
    '''Extract SNP profile for typing

    Some organisms use canonical SNP positions for typing
    (e.g., Mycobacterium tuberculosis lineages)
    '''
    from cyvcf2 import VCF

    vcf = VCF(vcf_file)
    profile = {}

    for pos in reference_positions:
        chrom, position = pos.split(':')
        for variant in vcf(f'{chrom}:{position}-{position}'):
            profile[pos] = variant.ALT[0] if variant.ALT else variant.REF

    return profile
```

## Enterobase Integration

```python
import requests

def query_enterobase(st, organism='ecoli'):
    '''Query Enterobase for ST metadata

    Enterobase provides:
    - Geographic distribution
    - Temporal trends
    - Associated serotypes
    - Virulence gene profiles
    '''
    # Note: Requires API token
    url = f'https://enterobase.warwick.ac.uk/api/v2.0/{organism}/sts/{st}'

    # Would need authentication headers
    # response = requests.get(url, headers={'Authorization': f'Bearer {token}'})

    print(f'Query Enterobase for ST{st}: {url}')
    return None  # Placeholder - requires authentication
```

## Related Skills

- epidemiological-genomics/phylodynamics - Time-scaled trees from typed isolates
- epidemiological-genomics/transmission-inference - Outbreak investigation
- metagenomics/kraken-classification - Species identification

Related Skills

tooluniverse-epigenomics

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready genomics and epigenomics data processing for BixBench questions. Handles methylation array analysis (CpG filtering, differential methylation, age-related CpG detection, chromosome-level density), ChIP-seq peak analysis (peak calling, motif enrichment, coverage stats), ATAC-seq chromatin accessibility, multi-omics integration (expression + methylation correlation), and genome-wide statistics. Pure Python computation (pandas, scipy, numpy, pysam, statsmodels) plus ToolUniverse annotation tools (Ensembl, ENCODE, SCREEN, JASPAR, ReMap, RegulomeDB, ChIPAtlas). Supports BED, BigWig, methylation beta-value matrices, Illumina manifest files, and multi-sample clinical data. Use when processing methylation data, ChIP-seq peaks, ATAC-seq signals, or answering questions about CpG sites, differential methylation, chromatin accessibility, histone marks, or epigenomic statistics.

claw-metagenomics

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Shotgun metagenomics profiling — taxonomy, resistome, and functional pathways

bio-metagenomics-visualization

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Visualize metagenomic profiles using R (phyloseq, microbiome) and Python (matplotlib, seaborn). Create stacked bar plots, heatmaps, PCA plots, and diversity analyses. Use when creating publication-quality figures from MetaPhlAn, Bracken, or other taxonomic profiling output.

bio-metagenomics-strain-tracking

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Track bacterial strains using MASH, sourmash, fastANI, and inStrain. Compare genomes, detect contamination, and monitor strain-level variation. Use when needing sub-species resolution for outbreak tracking, transmission analysis, or within-host strain dynamics.

bio-metagenomics-metaphlan

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Marker gene-based taxonomic profiling using MetaPhlAn 4. Provides accurate species-level relative abundances using clade-specific markers. Use when accurate taxonomic profiling is needed and computational resources are limited, or for comparison with HMP/other MetaPhlAn studies.

bio-metagenomics-kraken

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Taxonomic classification of metagenomic reads using Kraken2. Fast k-mer based classification against RefSeq database. Use when performing initial taxonomic classification of shotgun metagenomic reads before abundance estimation with Bracken.

bio-metagenomics-functional-profiling

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Profile functional potential of metagenomes using HUMAnN3 and similar tools. Use when obtaining pathway abundances, gene family counts, or functional annotations from metagenomic data.

bio-metagenomics-amr-detection

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Detect antimicrobial resistance genes using AMRFinderPlus, ResFinder, and CARD. Screen isolates and metagenomes for resistance determinants. Use when characterizing resistance profiles in clinical isolates, surveillance samples, or metagenomic data.

bio-metagenomics-abundance

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Species abundance estimation using Bracken with Kraken2 output. Redistributes reads from higher taxonomic levels to species for more accurate estimates. Use when accurate species-level abundances are needed from Kraken2 classification output.

bio-imaging-mass-cytometry-phenotyping

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Cell type assignment from marker expression in IMC data. Covers manual gating, clustering, and automated classification approaches. Use when assigning cell types to segmented IMC cells based on protein marker expression or when phenotyping cells in multiplexed imaging data.

bio-flow-cytometry-clustering-phenotyping

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Unsupervised clustering and cell type identification for flow/mass cytometry. Covers FlowSOM, Phenograph, and CATALYST workflows. Use when discovering cell populations in high-dimensional cytometry data without predefined gates.

bio-epidemiological-genomics-variant-surveillance

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Assign pathogen lineages and track variants using Nextclade and pangolin for viral surveillance. Monitor variant prevalence and identify emerging variants of concern. Use when classifying viral sequences, tracking lineage dynamics, or monitoring for variants of concern.