tooluniverse-single-cell

Production-ready single-cell and expression matrix analysis using scanpy, anndata, and scipy. Performs scRNA-seq QC, normalization, PCA, UMAP, Leiden/Louvain clustering, differential expression (Wilcoxon, t-test, DESeq2), cell type annotation, per-cell-type statistical analysis, gene-expression correlation, batch correction (Harmony), trajectory inference, and cell-cell communication analysis. NEW: Analyzes ligand-receptor interactions between cell types using OmniPath (CellPhoneDB, CellChatDB), scores communication strength, identifies signaling cascades, and handles multi-subunit receptor complexes. Integrates with ToolUniverse gene annotation tools (HPA, Ensembl, MyGene, UniProt) and enrichment tools (gseapy, PANTHER, STRING). Supports h5ad, 10X, CSV/TSV count matrices, and pre-annotated datasets. Use when analyzing single-cell RNA-seq data, studying cell-cell interactions, performing cell type differential expression, computing gene-expression correlations by cell type, analyzing tumor-immune communication, or answering questions about scRNA-seq datasets.

1,202 stars

bymims-harvard

View on GitHub Installation ↓

Best use case

tooluniverse-single-cell is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using tooluniverse-single-cell should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/tooluniverse-single-cell/SKILL.md --create-dirs "https://raw.githubusercontent.com/mims-harvard/ToolUniverse/main/skills/tooluniverse-single-cell/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/tooluniverse-single-cell/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How tooluniverse-single-cell Compares

Feature / Agent	tooluniverse-single-cell	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

AI Agent for Product Research

Browse AI agent skills for product research, competitive analysis, customer discovery, and structured product decision support.

AI Agent for SaaS Idea Validation

Use AI agent skills for SaaS idea validation, market research, customer discovery, competitor analysis, and documenting startup hypotheses.

SKILL.md Source

# Single-Cell Genomics and Expression Matrix Analysis

Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.

---

## LOOK UP, DON'T GUESS
When uncertain about any scientific fact, SEARCH databases first (PubMed, UniProt, ChEMBL, ClinVar, etc.) rather than reasoning from memory. A database-verified answer is always more reliable than a guess.

---

## When to Use This Skill

Apply when users:
- Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
- Ask about cell type identification, clustering, or annotation
- Need differential expression analysis by cell type or condition
- Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
- Ask about PCA, UMAP, t-SNE for expression data
- Need Leiden/Louvain clustering on expression matrices
- Want statistical comparisons between cell types (t-test, ANOVA, fold change)
- Ask about marker genes, batch correction, trajectory, or cell-cell communication

**BixBench Coverage**: 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36)

**NOT for** (use other skills instead):
- Bulk RNA-seq DESeq2 only -> `tooluniverse-rnaseq-deseq2`
- Gene enrichment only -> `tooluniverse-gene-enrichment`
- VCF/variant analysis -> `tooluniverse-variant-analysis`

---

## Core Principles

1. **Data-first** - Load, inspect, validate before analysis
2. **AnnData-centric** - All data flows through anndata objects
3. **Cell type awareness** - Per-cell-type subsetting when needed
4. **Statistical rigor** - Normalization, multiple testing correction, effect sizes
5. **Question-driven** - Parse what the user is actually asking

---

## Required Packages

```python
import scanpy as sc, anndata as ad, pandas as pd, numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import gseapy as gp  # enrichment
import harmonypy     # batch correction (optional)
```

Install: `pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels`

---

## Workflow Decision Tree

```
START: User question about scRNA-seq data
|
+-- FULL PIPELINE (raw counts -> annotated clusters)
|   Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE
|   See: references/scanpy_workflow.md
|
+-- DIFFERENTIAL EXPRESSION (per-cell-type comparison)
|   Most common BixBench pattern (bix-33)
|   See: analysis_patterns.md "Pattern 1"
|
+-- CORRELATION ANALYSIS (gene property vs expression)
|   Pattern: Gene length vs expression (bix-22)
|   See: analysis_patterns.md "Pattern 2"
|
+-- CLUSTERING & PCA (expression matrix analysis)
|   See: references/clustering_guide.md
|
+-- CELL COMMUNICATION (ligand-receptor interactions)
|   See: references/cell_communication.md
|
+-- TRAJECTORY ANALYSIS (pseudotime)
    See: references/trajectory_analysis.md
```

**Data format handling**:
- h5ad -> `sc.read_h5ad()`
- 10X -> `sc.read_10x_mtx()` or `sc.read_10x_h5()`
- CSV/TSV -> `pd.read_csv()` -> Convert to AnnData (check orientation!)

---

## Data Loading

AnnData expects: **cells/samples as rows (obs), genes as columns (var)**

```python
adata = sc.read_h5ad("data.h5ad")  # h5ad already oriented

# CSV/TSV: check orientation
df = pd.read_csv("counts.csv", index_col=0)
if df.shape[0] > df.shape[1] * 5:  # genes > samples by 5x => transpose
    df = df.T
adata = ad.AnnData(df)

# Load metadata
meta = pd.read_csv("metadata.csv", index_col=0)
common = adata.obs_names.intersection(meta.index)
adata = adata[common].copy()
for col in meta.columns:
    adata.obs[col] = meta.loc[common, col]
```

---

## Quality Control

```python
adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'mt-'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_genes(adata, min_cells=3)
```

See: references/scanpy_workflow.md for details

---

## Complete Pipeline (Quick Reference)

```python
import scanpy as sc

adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")

# QC
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

# Normalize + HVG + PCA
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata, n_comps=50)

# Cluster + UMAP
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)

# Find markers + Annotate + Per-cell-type DE
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
```

---

## Differential Expression Decision Tree

```
Single-Cell DE (many cells per condition):
  Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg
  Best for: Per-cell-type DE, marker gene finding

Pseudo-Bulk DE (aggregate counts by sample):
  Use: DESeq2 via PyDESeq2
  Best for: Sample-level comparisons with replicates

Statistical Tests Only:
  Use: scipy.stats (ttest_ind, f_oneway, pearsonr)
  Best for: Correlation, ANOVA, t-tests on summaries
```

---

## Statistical Tests (Quick Reference)

```python
from scipy import stats
from statsmodels.stats.multitest import multipletests

# Pearson/Spearman correlation
r, p = stats.pearsonr(gene_lengths, mean_expression)

# Welch's t-test
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)

# ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)

# Multiple testing correction (BH)
reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')
```

---

## Batch Correction (Harmony)

```python
import harmonypy
sc.tl.pca(adata, n_comps=50)
ho = harmonypy.run_harmony(adata.obsm['X_pca'][:, :30], adata.obs, 'batch', random_state=0)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
```

---

## ToolUniverse Integration

### Data Discovery (before analysis)
- **CxGDisc_search_datasets**: Search CELLxGENE Discover for scRNA-seq datasets by disease, tissue, organism. Use broad disease terms (e.g., "breast cancer" not "triple-negative").
- **GEO_search_rnaseq_datasets** / **geo_search_datasets**: Search GEO for scRNA-seq studies
- **NCBI_SRA_search_runs**: Search SRA for sequencing runs (query="single cell RNA-seq [disease]")
- **OmicsDI_search_datasets**: Cross-repository dataset search

### Cell Type Markers
- **CellMarker_search_by_cell_type**: Tissue-specific cell markers (use `CellMarker_list_cell_types` first — exact names required, e.g., "Regulatory T(Treg) cell" not "Regulatory T cell")
- **CellMarker_search_cancer_markers**: Cancer-context markers with experimental evidence
- **CellMarker_search_by_gene**: Reverse lookup — which cell types express a gene?
- **HPA_search_genes_by_query**: Cell-type marker gene search

### Gene Annotation
- **MyGene_query_genes** / **MyGene_batch_query**: Gene ID conversion
- **ensembl_lookup_gene**: Ensembl gene details
- **UniProt_get_function_by_accession**: Protein function

### Cell-Cell Communication
- **OmniPath_get_ligand_receptor_interactions**: L-R pairs (CellPhoneDB, CellChatDB)
- **OmniPath_get_signaling_interactions**: Downstream signaling
- **OmniPath_get_complexes**: Multi-subunit receptors

### Enrichment (Post-DE)
- **PANTHER_enrichment**: GO enrichment (BP, MF, CC)
- **STRING_functional_enrichment**: Network-based enrichment
- **ReactomeAnalysis_pathway_enrichment**: Reactome pathways

### Clinical Context (for tumor immunology)
- **DGIdb_get_drug_gene_interactions**: Drug interactions for immune checkpoint targets (genes=["CD274"] for PD-L1)
- **civic_search_evidence_items**: Clinical evidence for mutations/biomarkers
- **TIMER2_immune_estimation**: TCGA immune infiltration correlation
- **search_clinical_trials**: Clinical trial matching
- **GTEx_get_expression_summary**: Normal tissue baseline expression
- **PubMed_search_articles**: Literature context

---

## Scanpy vs Seurat Equivalents

| Operation | Seurat (R) | Scanpy (Python) |
|-----------|------------|-----------------|
| Load data | `Read10X()` | `sc.read_10x_mtx()` |
| Normalize | `NormalizeData()` | `sc.pp.normalize_total() + sc.pp.log1p()` |
| Find HVGs | `FindVariableFeatures()` | `sc.pp.highly_variable_genes()` |
| PCA | `RunPCA()` | `sc.tl.pca()` |
| Cluster | `FindClusters()` | `sc.tl.leiden()` |
| UMAP | `RunUMAP()` | `sc.tl.umap()` |
| Find markers | `FindMarkers()` | `sc.tl.rank_genes_groups()` |
| Batch correction | `RunHarmony()` | `harmonypy.run_harmony()` |

---

## Reasoning Framework for Result Interpretation

### Evidence Grading

| Grade | Criteria | Example |
|-------|----------|---------|
| **High confidence** | Marker padj < 0.01, log2FC > 1, expressed in > 25% of cluster cells | CD3D as T-cell marker with padj = 1e-50, log2FC = 3.2, pct = 0.85 |
| **Moderate confidence** | padj < 0.05, log2FC > 0.5, or expressed in 10-25% of cluster | FOXP3 in Treg cluster with padj = 0.001, pct = 0.18 |
| **Low confidence** | padj < 0.05 but log2FC < 0.5 or low pct_diff between clusters | Ubiquitously expressed gene with marginal enrichment |
| **Unreliable** | Fewer than 20 cells in cluster, or QC metrics suggest doublets | Cluster with mean nGenes > 6000 and high doublet score |

### Interpretation Guidance

- **QC metric thresholds**: Standard filters are nGenes > 200 (remove empty droplets), nGenes < 5000-6000 (remove doublets), pct_counts_mt < 20% (remove dying cells). These thresholds are tissue-dependent: immune cells tolerate stricter nGene filters; neurons may have higher mitochondrial content naturally. Always visualize distributions before setting cutoffs.
- **Cluster resolution guidance**: Leiden resolution 0.3-0.5 yields broad cell types (T cells, B cells, myeloid). Resolution 0.8-1.2 resolves subtypes (CD4 naive, CD4 memory, Treg). Resolution > 2.0 risks over-clustering (splitting biologically homogeneous populations). Validate by checking that each cluster has distinct marker genes.
- **Marker gene confidence levels**: A strong marker is highly specific (high pct_diff between cluster and rest) and highly expressed (high log2FC). Genes expressed in many clusters with small fold changes are poor markers. Cross-reference with known markers from CellMarker or HPA databases.
- **Pseudo-bulk vs single-cell DE**: For comparing conditions (treatment vs control), pseudo-bulk DE (aggregate by sample, then DESeq2) is more statistically valid than single-cell DE, which inflates significance due to non-independence of cells from the same sample.
- **Batch effects**: If samples cluster by batch rather than biology on UMAP, apply Harmony or other correction before biological interpretation.

### Synthesis Questions

1. Do the identified clusters correspond to known cell types based on canonical markers, or do some clusters lack clear biological identity (potentially doublets or low-quality cells)?
2. At the chosen clustering resolution, are there clusters that merge when resolution is lowered, suggesting they may be a single cell type split by technical noise?
3. For differential expression between conditions, are the results consistent between single-cell and pseudo-bulk approaches, and do the top DE genes have known biological relevance?
4. Do QC-flagged cells (high mito, extreme gene counts) concentrate in specific clusters, and does removing them change the clustering structure?
5. If batch correction was applied, do post-correction clusters still maintain expected cell-type-specific marker expression?

---

## Troubleshooting

| Issue | Solution |
|-------|----------|
| `ModuleNotFoundError: leidenalg` | `pip install leidenalg` |
| Sparse matrix errors | `.toarray()`: `X = adata.X.toarray() if issparse(adata.X) else adata.X` |
| Wrong matrix orientation | More genes than samples? Transpose |
| NaN in correlation | Filter: `valid = ~np.isnan(x) & ~np.isnan(y)` |
| Too few cells for DE | Need >= 3 cells per condition per cell type |
| Memory error | Use `sc.pp.highly_variable_genes()` to reduce features |

---

## Reference Documentation

**Detailed Analysis Patterns**: analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication)

**Core Workflows**:
- references/scanpy_workflow.md - Complete scanpy pipeline
- references/seurat_workflow.md - Seurat to Scanpy translation
- references/clustering_guide.md - Clustering methods
- references/marker_identification.md - Marker genes, annotation
- references/trajectory_analysis.md - Pseudotime
- references/cell_communication.md - OmniPath/CellPhoneDB workflow
- references/troubleshooting.md - Detailed error solutions

Related Skills

tooluniverse

1202

from mims-harvard/ToolUniverse

Router skill for ToolUniverse tasks. First checks if specialized tooluniverse skills (105+ skills covering disease/drug/target research, gene-disease associations, clinical decision support, genomics, epigenomics, proteomics, comparative genomics, chemical safety, toxicology, systems biology, and more) can solve the problem, then falls back to general strategies for using 2300+ scientific tools. Covers tool discovery, multi-hop queries, comprehensive research workflows, disambiguation, evidence grading, and report generation. Use when users need to research any scientific topic, find biological data, or explore drug/target/disease relationships. ALSO USE for any biology, medicine, chemistry, pharmacology, or life science question — even simple factoid questions like "how many X in protein Y", "what drug interacts with Z", "what gene causes disease W", or "translate this sequence". These questions benefit from database lookups (UniProt, PubMed, ChEMBL, ClinVar, GWAS Catalog, etc.) rather than answering from memory alone. When in doubt about a scientific fact, USE THIS SKILL to verify against real databases.

tooluniverse-variant-to-mechanism

1202

from mims-harvard/ToolUniverse

End-to-end variant-to-mechanism analysis: given a genetic variant (rsID or coordinates), trace its functional impact from regulatory context (GWAS, eQTL, RegulomeDB, ENCODE) through target gene identification (GTEx, OpenTargets L2G) to downstream pathway and disease biology (STRING, Reactome, GO enrichment, disease associations). Produces an evidence-graded mechanistic narrative linking genotype to phenotype. Use when asked "how does this variant cause disease?", "what is the mechanism of rs7903146?", "trace variant to pathway", or "connect this GWAS hit to biology".

tooluniverse-variant-interpretation

1202

from mims-harvard/ToolUniverse

Systematic clinical variant interpretation from raw variant calls to ACMG-classified recommendations with structural impact analysis. Aggregates evidence from ClinVar, gnomAD, CIViC, UniProt, and PDB across ACMG criteria. Produces pathogenicity scores (0-100), clinical recommendations, and treatment implications. Use when interpreting genetic variants, classifying variants of uncertain significance (VUS), performing ACMG variant classification, or translating variant calls to clinical actionability.

tooluniverse-variant-functional-annotation

1202

from mims-harvard/ToolUniverse

Comprehensive functional annotation of protein variants — pathogenicity, population frequency, structural context, and clinical significance. Integrates ProtVar (map_variant, get_function, get_population) for protein-level mapping and structural context, ClinVar for clinical classifications, gnomAD for population frequency with ancestry data, CADD for deleteriousness scores, and ClinGen for gene-disease validity. Produces a structured variant annotation report with evidence grading. Use when asked about protein variant impact, missense variant pathogenicity, ProtVar annotation, variant functional context, or combining population and structural evidence for a variant.

tooluniverse-variant-analysis

1202

from mims-harvard/ToolUniverse

Production-ready VCF processing, variant annotation, mutation analysis, and structural variant (SV/CNV) interpretation for bioinformatics questions. Parses VCF files (streaming, large files), classifies mutation types (missense, nonsense, synonymous, frameshift, splice, intronic, intergenic) and structural variants (deletions, duplications, inversions, translocations), applies VAF/depth/quality/consequence filters, annotates with ClinVar/dbSNP/gnomAD/CADD via ToolUniverse, interprets SV/CNV clinical significance using ClinGen dosage sensitivity scores, computes variant statistics, and generates reports. Solves questions like "What fraction of variants with VAF < 0.3 are missense?", "How many non-reference variants remain after filtering intronic/intergenic?", "What is the pathogenicity of this deletion affecting BRCA1?", or "Which dosage-sensitive genes overlap this CNV?". Use when processing VCF files, annotating variants, filtering by VAF/depth/consequence, classifying mutations, interpreting structural variants, assessing CNV pathogenicity, comparing cohorts, or answering variant analysis questions.

tooluniverse-vaccine-design

1202

from mims-harvard/ToolUniverse

Design and evaluate vaccine candidates using computational immunology tools. Covers epitope prediction (MHC-I/II binding via IEDB), population coverage analysis, antigen selection, adjuvant matching, and immunogenicity assessment. Integrates IEDB for epitope prediction, UniProt for antigen sequences, PDB/AlphaFold for structural epitopes, BVBRC for pathogen proteomes, and literature for clinical precedent. Use when asked about vaccine design, epitope prediction, immunogenicity, MHC binding, T-cell epitopes, B-cell epitopes, or population coverage for vaccine candidates.

tooluniverse-toxicology

1202

from mims-harvard/ToolUniverse

Assess chemical and drug toxicity via adverse outcome pathways, real-world adverse event signals, and toxicogenomic evidence. Integrates AOPWiki (AOPWiki_list_aops, AOPWiki_get_aop) for mechanism- level pathway tracing, FAERS for post-market adverse event quantification, OpenFDA for label mining, and CTD for chemical-gene-disease evidence. Produces structured toxicity reports with evidence grading (T1-T4). Use when asked about toxicity mechanisms, adverse outcome pathways, AOP mapping, FAERS signal detection, or chemical-disease relationships for drugs or environmental chemicals.

tooluniverse-target-research

1202

from mims-harvard/ToolUniverse

Gather comprehensive biological target intelligence from 9 parallel research paths covering protein info, structure, interactions, pathways, expression, variants, drug interactions, and literature. Features collision-aware searches, evidence grading (T1-T4), explicit Open Targets coverage, and mandatory completeness auditing. Use when users ask about drug targets, proteins, genes, or need target validation, druggability assessment, or comprehensive target profiling.

tooluniverse-systems-biology

1202

from mims-harvard/ToolUniverse

Comprehensive systems biology and pathway analysis using multiple pathway databases (Reactome, KEGG, WikiPathways, Pathway Commons, BioModels). Performs pathway enrichment, protein-pathway mapping, keyword searches, and systems-level analysis. Use when analyzing gene sets, exploring biological pathways, or investigating systems-level biology.

tooluniverse-structural-variant-analysis

1202

from mims-harvard/ToolUniverse

Comprehensive structural variant (SV) analysis skill for clinical genomics. Classifies SVs (deletions, duplications, inversions, translocations), assesses pathogenicity using ACMG-adapted criteria, evaluates gene disruption and dosage sensitivity, and provides clinical interpretation with evidence grading. Use when analyzing CNVs, large deletions/duplications, chromosomal rearrangements, or any structural variants requiring clinical interpretation.

tooluniverse-structural-proteomics

1202

from mims-harvard/ToolUniverse

Integrate structural biology data with proteomics for drug target validation. Retrieves protein structures from PDB (RCSB, PDBe), AlphaFold predictions, antibody structures (SAbDab), GPCR data (GPCRdb), binding pocket analysis (ProteinsPlus), and ligand interactions (BindingDB). Use when asked to find structures for a drug target, identify binding site ligands, cross-validate drug binding with structural data, assess structural druggability, or compare experimental vs predicted structures.

tooluniverse-stem-cell-organoid

1202

from mims-harvard/ToolUniverse

Research stem cells, iPSCs, organoids, and cell differentiation using ToolUniverse tools. Covers pluripotency marker identification, differentiation pathway analysis, organoid model characterization, cell type annotation, and disease modeling. Integrates CellxGene/HCA for single-cell atlas data, CellMarker for cell type markers, GEO for stem cell datasets, and pathway tools for differentiation signaling. Use when asked about stem cells, iPSCs, organoids, cell reprogramming, pluripotency, differentiation protocols, or 3D culture models.