bio-epidemiological-genomics-pathogen-typing
Perform multi-locus sequence typing (MLST), core genome MLST, and SNP-based strain typing for bacterial isolate characterization using mlst and chewBBACA. Use when identifying strain types, tracking outbreak clones, or characterizing bacterial isolates.
Best use case
bio-epidemiological-genomics-pathogen-typing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Perform multi-locus sequence typing (MLST), core genome MLST, and SNP-based strain typing for bacterial isolate characterization using mlst and chewBBACA. Use when identifying strain types, tracking outbreak clones, or characterizing bacterial isolates.
Teams using bio-epidemiological-genomics-pathogen-typing should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/bio-epidemiological-genomics-pathogen-typing/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How bio-epidemiological-genomics-pathogen-typing Compares
| Feature / Agent | bio-epidemiological-genomics-pathogen-typing | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Perform multi-locus sequence typing (MLST), core genome MLST, and SNP-based strain typing for bacterial isolate characterization using mlst and chewBBACA. Use when identifying strain types, tracking outbreak clones, or characterizing bacterial isolates.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
## Version Compatibility
Reference examples tested with: mlst 2.23+, numpy 1.26+, pandas 2.2+, scanpy 1.10+, scipy 1.12+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Pathogen Typing
**"Type my bacterial isolates by MLST"** → Assign multi-locus sequence types to bacterial genomes for isolate characterization, outbreak clone identification, and strain tracking.
- CLI: `mlst assembly.fasta` for 7-gene MLST typing
- CLI: `chewBBACA.py AlleleCall` for core genome MLST (cgMLST)
## MLST with mlst Tool
```bash
# Install mlst
conda install -c bioconda mlst
# Basic MLST typing
mlst genome.fasta
# Output: genome.fasta ecoli ST131 adk(53) fumC(40) gyrB(47) ...
# Batch typing
mlst *.fasta > typing_results.tsv
# Specify scheme
mlst --scheme senterica genome.fasta
# List available schemes
mlst --list
# Include allele sequences in output
mlst --csv genome.fasta > results.csv
```
## Parse MLST Results
```python
import pandas as pd
import subprocess
def run_mlst(fasta_files, scheme=None):
'''Run MLST on multiple genomes
Returns DataFrame with:
- Sample name
- Scheme (auto-detected or specified)
- Sequence type (ST)
- Allele profiles
ST interpretation:
- Known ST: Matches existing type in database
- Novel allele: New allele combination, may be unreported ST
- Failed: Unable to determine (poor assembly or wrong scheme)
'''
cmd = ['mlst'] + fasta_files
if scheme:
cmd.extend(['--scheme', scheme])
result = subprocess.run(cmd, capture_output=True, text=True)
lines = result.stdout.strip().split('\n')
data = [line.split('\t') for line in lines]
return pd.DataFrame(data, columns=['file', 'scheme', 'ST'] +
[f'locus{i}' for i in range(1, len(data[0])-2)])
```
## Core Genome MLST (cgMLST)
```bash
# chewBBACA for cgMLST
pip install chewbbaca
# Download or create schema
chewBBACA.py DownloadSchema -sp "Salmonella enterica" -o schema_dir
# Run cgMLST
chewBBACA.py AlleleCall -i genomes/ -g schema_dir -o results/
# Analyze results
chewBBACA.py ExtractCgMLST -i results/results_alleles.tsv \
-o cgmlst_results.tsv --threshold 0.95
```
## cgMLST Distance Analysis
**Goal:** Compute pairwise allelic distances between isolates and cluster them to identify potential outbreak groups.
**Approach:** Count allelic differences between each pair of isolate profiles (ignoring missing data), then apply single-linkage hierarchical clustering with a pathogen-specific distance threshold.
```python
import pandas as pd
import numpy as np
def calculate_cgmlst_distance(profiles):
'''Calculate allelic distances between isolates
Distance interpretation (typical thresholds):
- 0-5 allele differences: Same cluster (likely recent transmission)
- 6-15 differences: Related (possible epidemiological link)
- >15 differences: Different clones
Note: Thresholds are pathogen-specific. Consult literature.
'''
n = len(profiles)
distances = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
# Count allelic differences (excluding missing data)
diff = sum(1 for a, b in zip(profiles.iloc[i], profiles.iloc[j])
if a != b and a != 0 and b != 0)
distances[i, j] = distances[j, i] = diff
return pd.DataFrame(distances, index=profiles.index, columns=profiles.index)
def identify_clusters(distance_matrix, threshold=5):
'''Identify cgMLST clusters
Threshold values by organism:
- E. coli: 10 alleles
- Salmonella: 7 alleles
- Listeria: 7 alleles
- S. aureus: 24 alleles
'''
from scipy.cluster.hierarchy import linkage, fcluster
# Convert to condensed distance matrix
condensed = distance_matrix.values[np.triu_indices(len(distance_matrix), k=1)]
# Hierarchical clustering
Z = linkage(condensed, method='single')
clusters = fcluster(Z, t=threshold, criterion='distance')
return dict(zip(distance_matrix.index, clusters))
```
## SNP-Based Typing
```python
def snp_typing_from_vcf(vcf_file, reference_positions):
'''Extract SNP profile for typing
Some organisms use canonical SNP positions for typing
(e.g., Mycobacterium tuberculosis lineages)
'''
from cyvcf2 import VCF
vcf = VCF(vcf_file)
profile = {}
for pos in reference_positions:
chrom, position = pos.split(':')
for variant in vcf(f'{chrom}:{position}-{position}'):
profile[pos] = variant.ALT[0] if variant.ALT else variant.REF
return profile
```
## Enterobase Integration
```python
import requests
def query_enterobase(st, organism='ecoli'):
'''Query Enterobase for ST metadata
Enterobase provides:
- Geographic distribution
- Temporal trends
- Associated serotypes
- Virulence gene profiles
'''
# Note: Requires API token
url = f'https://enterobase.warwick.ac.uk/api/v2.0/{organism}/sts/{st}'
# Would need authentication headers
# response = requests.get(url, headers={'Authorization': f'Bearer {token}'})
print(f'Query Enterobase for ST{st}: {url}')
return None # Placeholder - requires authentication
```
## Related Skills
- epidemiological-genomics/phylodynamics - Time-scaled trees from typed isolates
- epidemiological-genomics/transmission-inference - Outbreak investigation
- metagenomics/kraken-classification - Species identificationRelated Skills
tooluniverse-epigenomics
Production-ready genomics and epigenomics data processing for BixBench questions. Handles methylation array analysis (CpG filtering, differential methylation, age-related CpG detection, chromosome-level density), ChIP-seq peak analysis (peak calling, motif enrichment, coverage stats), ATAC-seq chromatin accessibility, multi-omics integration (expression + methylation correlation), and genome-wide statistics. Pure Python computation (pandas, scipy, numpy, pysam, statsmodels) plus ToolUniverse annotation tools (Ensembl, ENCODE, SCREEN, JASPAR, ReMap, RegulomeDB, ChIPAtlas). Supports BED, BigWig, methylation beta-value matrices, Illumina manifest files, and multi-sample clinical data. Use when processing methylation data, ChIP-seq peaks, ATAC-seq signals, or answering questions about CpG sites, differential methylation, chromatin accessibility, histone marks, or epigenomic statistics.
claw-metagenomics
Shotgun metagenomics profiling — taxonomy, resistome, and functional pathways
bio-metagenomics-visualization
Visualize metagenomic profiles using R (phyloseq, microbiome) and Python (matplotlib, seaborn). Create stacked bar plots, heatmaps, PCA plots, and diversity analyses. Use when creating publication-quality figures from MetaPhlAn, Bracken, or other taxonomic profiling output.
bio-metagenomics-strain-tracking
Track bacterial strains using MASH, sourmash, fastANI, and inStrain. Compare genomes, detect contamination, and monitor strain-level variation. Use when needing sub-species resolution for outbreak tracking, transmission analysis, or within-host strain dynamics.
bio-metagenomics-metaphlan
Marker gene-based taxonomic profiling using MetaPhlAn 4. Provides accurate species-level relative abundances using clade-specific markers. Use when accurate taxonomic profiling is needed and computational resources are limited, or for comparison with HMP/other MetaPhlAn studies.
bio-metagenomics-kraken
Taxonomic classification of metagenomic reads using Kraken2. Fast k-mer based classification against RefSeq database. Use when performing initial taxonomic classification of shotgun metagenomic reads before abundance estimation with Bracken.
bio-metagenomics-functional-profiling
Profile functional potential of metagenomes using HUMAnN3 and similar tools. Use when obtaining pathway abundances, gene family counts, or functional annotations from metagenomic data.
bio-metagenomics-amr-detection
Detect antimicrobial resistance genes using AMRFinderPlus, ResFinder, and CARD. Screen isolates and metagenomes for resistance determinants. Use when characterizing resistance profiles in clinical isolates, surveillance samples, or metagenomic data.
bio-metagenomics-abundance
Species abundance estimation using Bracken with Kraken2 output. Redistributes reads from higher taxonomic levels to species for more accurate estimates. Use when accurate species-level abundances are needed from Kraken2 classification output.
bio-imaging-mass-cytometry-phenotyping
Cell type assignment from marker expression in IMC data. Covers manual gating, clustering, and automated classification approaches. Use when assigning cell types to segmented IMC cells based on protein marker expression or when phenotyping cells in multiplexed imaging data.
bio-flow-cytometry-clustering-phenotyping
Unsupervised clustering and cell type identification for flow/mass cytometry. Covers FlowSOM, Phenograph, and CATALYST workflows. Use when discovering cell populations in high-dimensional cytometry data without predefined gates.
bio-epidemiological-genomics-variant-surveillance
Assign pathogen lineages and track variants using Nextclade and pangolin for viral surveillance. Monitor variant prevalence and identify emerging variants of concern. Use when classifying viral sequences, tracking lineage dynamics, or monitoring for variants of concern.