scikit-bio

A Python bioinformatics toolkit for sequence, phylogeny, and microbiome/community-ecology analysis; use it when you need to compute diversity/ordination/statistics from biological data and standard formats (FASTA/FASTQ/Newick/BIOM).

53 stars

byaipoch

View on GitHub Installation ↓

Best use case

scikit-bio is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using scikit-bio should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/scikit-bio/SKILL.md --create-dirs "https://raw.githubusercontent.com/aipoch/medical-research-skills/main/scientific-skills/Data Analysis/scikit-bio/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/scikit-bio/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How scikit-bio Compares

Feature / Agent	scikit-bio	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)

## When to Use

- You need to parse, validate, and manipulate biological sequences (DNA/RNA/protein) and their metadata.
- You are running microbiome/community-ecology workflows (alpha/beta diversity, UniFrac, ordination, PERMANOVA).
- You need to build, transform, or compare phylogenetic trees (Newick I/O, pruning/rerooting, patristic distances).
- You want to compute and work with distance matrices and downstream multivariate analyses (PCoA, Mantel, ANOSIM).
- You need to read/write common bioinformatics formats (FASTA/FASTQ, Newick, BIOM) and convert between them.

## Key Features

- **Sequence objects**: `DNA`, `RNA`, `Protein`, and generic `Sequence` with validation, slicing, motif search, reverse complement, transcription/translation, and metadata handling.
- **Alignment utilities**: pairwise local alignment (SSW-based) and multiple sequence alignment containers (`TabularMSA`) with consensus support.
- **Phylogenetics**: `TreeNode` manipulation, tree construction from distance matrices (e.g., Neighbor Joining), and tree distance/metrics.
- **Diversity**: alpha diversity (e.g., Shannon, Faith’s PD) and beta diversity (e.g., Bray-Curtis, UniFrac) returning `Series`/`DistanceMatrix`.
- **Ordination & stats**: PCoA and ecological hypothesis tests (PERMANOVA, ANOSIM, Mantel) operating on distance matrices.
- **I/O ecosystem**: FASTA/FASTQ and Newick reading/writing; BIOM table support via `Table`.

## Dependencies

- `scikit-bio>=0.6.0`
- `numpy>=1.23`
- `pandas>=1.5`

## Example Usage

```python
# pip install scikit-bio numpy pandas

import numpy as np
import pandas as pd

import skbio
from skbio import DNA, TreeNode
from skbio.diversity import alpha_diversity, beta_diversity
from skbio.stats.ordination import pcoa
from skbio.stats.distance import permanova

# ----------------------------
# 1) Sequence manipulation
# ----------------------------
seq = DNA("ACGTACGTNN--ACGT", metadata={"id": "seq1"})
seq_clean = seq.degap()
rc = seq_clean.reverse_complement()
motif_hits = seq_clean.find_with_regex("ACG[TA]")

print("Original:", str(seq))
print("Degapped:", str(seq_clean))
print("Reverse complement:", str(rc))
print("Motif hits:", list(motif_hits))

# ----------------------------
# 2) Microbiome-style counts
# ----------------------------
# rows = samples, cols = features/OTUs/ASVs
counts = np.array([
    [10,  0,  3,  1],
    [ 0,  8,  2,  0],
    [ 5,  1,  0,  4],
], dtype=int)

sample_ids = ["S1", "S2", "S3"]
feature_ids = ["F1", "F2", "F3", "F4"]

# Alpha diversity (Shannon)
shannon = alpha_diversity("shannon", counts, ids=sample_ids)
print("\nAlpha diversity (Shannon):")
print(shannon)

# Beta diversity (Bray-Curtis) -> DistanceMatrix
dm = beta_diversity("braycurtis", counts, ids=sample_ids)
print("\nBeta diversity (Bray-Curtis) distance matrix:")
print(dm)

# ----------------------------
# 3) Ordination (PCoA)
# ----------------------------
ord_res = pcoa(dm)
print("\nPCoA sample coordinates (first 2 axes):")
print(ord_res.samples[["PC1", "PC2"]])

# ----------------------------
# 4) PERMANOVA on the distance matrix
# ----------------------------
grouping = pd.Series(["A", "A", "B"], index=sample_ids)
perma = permanova(dm, grouping=grouping, permutations=99)
print("\nPERMANOVA result:")
print(perma)

# ----------------------------
# 5) Tree I/O (Newick) + basic manipulation
# ----------------------------
newick = "((F1:0.1,F2:0.2):0.3,(F3:0.2,F4:0.4):0.1);"
tree = TreeNode.read([newick])
subtree = tree.shear(["F1", "F2", "F3"])
print("\nSheared tree (tips F1,F2,F3):")
print(subtree.ascii_art())
```

## Implementation Details

- **Sequence model**
  - Use `DNA`/`RNA`/`Protein` for alphabet-aware validation and biological operations (e.g., `reverse_complement`, `transcribe`, `translate`).
  - Use `Sequence` when you need a generic container without strict alphabet constraints.
  - FASTQ quality scores (when read via scikit-bio I/O) are stored as positional metadata.

- **Diversity computations**
  - `alpha_diversity(metric, counts, ids=...)` returns a per-sample vector (typically a pandas `Series`).
  - `beta_diversity(metric, counts, ids=...)` returns a `DistanceMatrix` suitable for ordination and hypothesis tests.
  - Count inputs should be **non-negative integers** representing abundances (not relative frequencies). Phylogenetic metrics (e.g., Faith’s PD, UniFrac) additionally require a tree and feature/OTU IDs.

- **Distance matrices**
  - `DistanceMatrix` enforces symmetry and a zero diagonal; IDs are used for consistent alignment with metadata and group labels.
  - Many downstream methods (PCoA, PERMANOVA, ANOSIM, Mantel) operate directly on `DistanceMatrix`.

- **Ordination**
  - `pcoa(dm)` performs eigen-decomposition on a transformed distance matrix and returns `OrdinationResults` containing eigenvalues and sample coordinates.

- **Permutation-based statistics**
  - `permanova(dm, grouping, permutations=N)` estimates significance by permuting group labels; increase `permutations` (e.g., 999+) for more stable p-values in real analyses.

Related Skills

scikit-survival

from aipoch/medical-research-skills

A comprehensive toolkit for survival analysis and time-to-event modeling in Python using scikit-survival; use it when you need to model censored time-to-event outcomes, fit Cox/RSF/GB models or Survival SVMs, evaluate with C-index/Brier score, or handle competing risks.

skill-auditor

from aipoch/medical-research-skills

A comprehensive auditor for any agent skill — including Manus, OpenClaw/ClawHub, Claude, LobeHub, or custom SKILL.md-based skills. Use this skill whenever a user wants to evaluate, audit, review, score, or quality-check an agent skill before publishing, updating, or deploying. Covers two hard veto gates (structural redlines + research integrity redlines), static quality scoring across 25 criteria (ISO 25010 + OpenSSF + Agent), dynamic test input generation, multi-mode execution testing, multi-layer output evaluation with five specialized category rubrics (Evidence Insight / Protocol Design / Data Analysis / Academic Writing / Other), a Research Veto that applies to all four research categories, human eval viewer generation, actionable P0/P1/P2 optimization recommendations, and automatic skill improvement that outputs a polished, production-ready SKILL.md. Also use whenever a user says "audit my skill", "evaluate my skill", "improve my skill", or wants a corrected version after evaluation.

two-sample-mr-research-planner

from aipoch/medical-research-skills

Generates complete two-sample Mendelian randomization (MR) research designs from a user-provided research direction. Use when users want to design, plan, or build a study using two-sample MR to test causal relationships. Triggers:"design a two-sample MR study", "build a publishable MR paper", "test whether this biomarker causally affects this disease", "generate Lite/Standard/Advanced MR plans", "screen multiple exposures with MR", "bidirectional MR design", "causal inference using GWAS summary statistics", or "I want to study X and Y using MR". Always outputs four workload configurations (Lite / Standard / Advanced / Publication+) with a recommended primary plan, step-by-step workflow, figure plan, validation strategy, minimal executable version, and publication upgrade path.

research-proposal-generator

from aipoch/medical-research-skills

Generates a comprehensive research proposal design based on input literature, including hypothesis, mechanism verification, and budget. Use when the user wants to design a research project from a paper.

research-grants

from aipoch/medical-research-skills

Write competitive research proposals for NSF, NIH, DOE, DARPA, and Taiwan's NSTC when you need agency-compliant narratives, budgets, and review-criteria alignment for a specific solicitation/FOA/BAA.

protocol-standardization

from aipoch/medical-research-skills

Standardize fragmented experimental steps into reproducible protocol documents when you need method organization, lab SOP drafting, or cross-operator reproducibility; missing parameters must be explicitly marked as "To be supplemented/Not provided".

prospero-registration-helper

from aipoch/medical-research-skills

Assists researchers in generating PROSPERO registration content for meta-analyses from a title and optional protocol. Use when the user wants to draft a PROSPERO registration form.

non-tumor-ml-research-planner

from aipoch/medical-research-skills

Generates complete non-tumor biomedical machine learning research designs from a user-provided research direction. Always use this skill when users want to plan bioinformatics + ML papers for non-cancer diseases (metabolic, cardiovascular, kidney, inflammatory, autoimmune, infectious, neurological, endocrine, wound healing, chronic multifactor), design diagnostic biomarker studies, combine GEO datasets with feature selection and ML modeling, or generate Lite/Standard/Advanced/Publication+ workload plans. Trigger for:"non-tumor ML study", "bioinformatics paper outside oncology", "key genes and diagnostic model for a disease", "pyroptosis/ferroptosis/senescence/autophagy + disease", "GEO datasets + machine learning", "RF + LASSO diagnostic model", "DEG + feature selection + validation", "immune infiltration + biomarker", "non-cancer biomarker paper". Trigger even for casual phrasings like "I want to study X using machine learning", "help me design a non-tumor bioinformatics paper", or "how do I build a diagnostic model for disease Y".

network-tox-docking-research-planner

from aipoch/medical-research-skills

Generates complete network toxicology + molecular docking research designs from a user-provided toxicant and disease/phenotype. Always use this skill when users want to investigate how an environmental toxicant, endocrine disruptor, heavy metal, food contaminant, pharmaceutical residue, or consumer product chemical may contribute to a disease through shared molecular targets, hub genes, pathways, and docking evidence. Trigger for:"network toxicology study", "toxicology mechanism paper", "target prediction + PPI + docking", "environmental pollutant and disease mechanism", "hub genes and docking for toxicant", "Lite/Standard/Advanced toxicology plan", "CTD + SwissTargetPrediction + GeneCards + STRING", "CB-Dock2 docking study", "triclosan/BPA/cadmium/PFAS + disease". Also triggers for Chinese phrasings:"网络毒理学研究设计"、"毒物机制论文"、"靶点预测+PPI+对接"、"环境污染物与疾病机制". Trigger even for casual phrasings like "I want to study how chemical X affects disease Y" or "help me design a toxicology paper". Always output four workload configurations (Lite / Standard / Advanced / Publication+) with a recommended primary plan, step-by-step workflow, figure plan, validation strategy, minimal executable version, and publication upgrade path.

meta-protocol-writer

from aipoch/medical-research-skills

Generates a PROSPERO-compliant Meta-analysis protocol based on Title and PICOS. Use when the user wants to write a protocol for a systematic review or meta-analysis.

hypothesis-generation

from aipoch/medical-research-skills

Structured scientific hypothesis formulation from observations; use when you have experimental observations or preliminary data and need testable hypotheses with predictions, mechanisms, and validation experiments.

hypogenic

from aipoch/medical-research-skills

Automated LLM-driven hypothesis generation and testing for tabular datasets; use when you need systematic exploration of empirical patterns (e.g., fraud detection, content analysis) and want to combine literature insights with data-driven hypothesis evaluation.