single-cell-clustering-and-batch-correction-with-omicverse

Guide Claude through omicverse's single-cell clustering workflow, covering preprocessing, QC, multimethod clustering, topic modeling, cNMF, and cross-batch integration as demonstrated in t_cluster.ipynb and t_single_batch.ipynb.

1,802 stars

Best use case

single-cell-clustering-and-batch-correction-with-omicverse is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Guide Claude through omicverse's single-cell clustering workflow, covering preprocessing, QC, multimethod clustering, topic modeling, cNMF, and cross-batch integration as demonstrated in t_cluster.ipynb and t_single_batch.ipynb.

Teams using single-cell-clustering-and-batch-correction-with-omicverse should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/single-clustering/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/single-clustering/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/single-clustering/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How single-cell-clustering-and-batch-correction-with-omicverse Compares

Feature / Agentsingle-cell-clustering-and-batch-correction-with-omicverseStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Guide Claude through omicverse's single-cell clustering workflow, covering preprocessing, QC, multimethod clustering, topic modeling, cNMF, and cross-batch integration as demonstrated in t_cluster.ipynb and t_single_batch.ipynb.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Single-cell clustering and batch correction with omicverse

## Overview
This skill distills the single-cell tutorials [`t_cluster.ipynb`](../../omicverse_guide/docs/Tutorials-single/t_cluster.ipynb) and [`t_single_batch.ipynb`](../../omicverse_guide/docs/Tutorials-single/t_single_batch.ipynb). Use it when a user wants to preprocess an `AnnData` object, explore clustering alternatives (Leiden, Louvain, scICE, GMM, topic/cNMF models), and evaluate or harmonise batches with omicverse utilities.

## Instructions
1. **Import libraries and set plotting defaults**
   - Load `omicverse as ov`, `scanpy as sc`, and plotting helpers (`scvelo as scv` when using dentate gyrus demo data).
   - Apply `ov.plot_set()` or `ov.utils.ov_plot_set()` so figures adopt omicverse styling before embedding plots.
2. **Load data and annotate batches**
   - For demo clustering, fetch `scv.datasets.dentategyrus()`; for integration, read provided `.h5ad` files via `ov.read()` and set `adata.obs['batch']` identifiers for each cohort.
   - Confirm inputs are sparse numeric matrices; convert with `adata.X = adata.X.astype(np.int64)` when required for QC steps.
3. **Run quality control**
   - Execute `ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, batch_key='batch')` to drop low-quality cells and inspect summary statistics per batch.
   - Save intermediate filtered objects (`adata.write_h5ad(...)`) so users can resume from clean checkpoints.
4. **Preprocess and select features**
   - Call `ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=3000, batch_key=None)` to normalise, log-transform, and flag highly variable genes; assign `adata.raw = adata` and subset to `adata.var.highly_variable_features` for downstream modelling.
   - Scale expression (`ov.pp.scale(adata)`) and compute PCA scores with `ov.pp.pca(adata, layer='scaled', n_pcs=50)`. Encourage reviewing variance explained via `ov.utils.plot_pca_variance_ratio(adata)`.
5. **Construct neighbourhood graph and baseline clustering**
   - Build neighbour graph using `sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')` or `ov.pp.neighbors(...)`.
   - Generate Leiden or Louvain labels through `ov.utils.cluster(adata, method='leiden'|'louvain', resolution=1)`, `ov.single.leiden(adata, resolution=1.0)`, or `ov.pp.leiden(adata, resolution=1)`; remind users that resolution tunes granularity.
   - **IMPORTANT - Dependency checks**: Always verify prerequisites before clustering or plotting:
     ```python
     # Before clustering: check neighbors graph exists
     if 'neighbors' not in adata.uns:
         if 'X_pca' in adata.obsm:
             ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
         else:
             raise ValueError("PCA must be computed before neighbors graph")

     # Before plotting by cluster: check clustering was performed
     if 'leiden' not in adata.obs:
         ov.single.leiden(adata, resolution=1.0)
     ```
   - Visualise embeddings with `ov.pl.embedding(adata, basis='X_umap', color=['clusters','leiden'], frameon='small', wspace=0.5)` and confirm cluster separation. Always check that columns in `color=` parameter exist in `adata.obs` before plotting.
6. **Explore advanced clustering strategies**
   - **scICE consensus**: instantiate `model = ov.utils.cluster(adata, method='scICE', use_rep='scaled|original|X_pca', resolution_range=(4,20), n_boot=50, n_steps=11)` and inspect stability via `model.plot_ic(figsize=(6,4))` before selecting `model.best_k` groups.
   - **Gaussian mixtures**: run `ov.utils.cluster(..., method='GMM', n_components=21, covariance_type='full', tol=1e-9, max_iter=1000)` for model-based assignments.
   - **Topic modelling**: fit `LDA_obj = ov.utils.LDA_topic(...)`, review `LDA_obj.plot_topic_contributions(6)`, derive cluster calls with `LDA_obj.predicted(k)` and optionally refine using `LDA_obj.get_results_rfc(...)`.
   - **cNMF programs**: initialise `cnmf_obj = ov.single.cNMF(... components=np.arange(5,11), n_iter=20, num_highvar_genes=2000, output_dir=...)`, factorise (`factorize`, `combine`), select K via `k_selection_plot`, and propagate usage scores back with `cnmf_obj.get_results(...)` and `cnmf_obj.get_results_rfc(...)`.
7. **Evaluate clustering quality**
   - Compare predicted labels against known references with `adjusted_rand_score(adata.obs['clusters'], adata.obs['leiden'])` and report metrics for each method (Leiden, Louvain, GMM, LDA variants, cNMF models) to justify chosen parameters.
8. **Embed with multiple layouts**
   - Use `ov.utils.mde(...)` to create MDE projections from different latent spaces (`adata.obsm["scaled|original|X_pca"]`, harmonised embeddings, topic compositions) and plot via `ov.utils.embedding(..., color=['batch','cell_type'])` or `ov.pl.embedding` for consistent review of cluster/batch mixing.
9. **Perform batch correction and integration**
   - Apply `ov.single.batch_correction(adata, batch_key='batch', methods='harmony'|'combat'|'scanorama'|'scVI'|'CellANOVA', n_pcs=50, ...)` sequentially to generate harmonised embeddings stored in `adata.obsm` (`X_harmony`, `X_combat`, `X_scanorama`, `X_scVI`, `X_cellanova`). For `scVI`, mention latent size (`n_latent=30`) and `gene_likelihood="nb"`; for CellANOVA define control pools via `control_dict`.
   - After each correction, project to 2D with `ov.utils.mde` and visualise `batch` vs `cell_type` to check mixing and conservation.
10. **Benchmark integration performance**
    - Persist final object (`adata.write_h5ad('neurips2021_batch_all.h5ad', compression='gzip')`) and reload when necessary.
    - Use `scib_metrics.benchmark.Benchmarker` with embeddings list (`["X_pca", "X_combat", "X_harmony", "X_cellanova", "X_scanorama", "X_mira_topic", "X_mira_feature", "X_scVI"]`) to compute batch-vs-biology trade-offs via `bm.benchmark()` and summarise with `bm.plot_results_table(min_max_scale=False)`.
11. **General troubleshooting**
    - Ensure `adata.raw` captures the unscaled log-normalised matrix before subsetting to HVGs.
    - Confirm `use_rep='scaled|original|X_pca'` strings exist in `adata.obsm` prior to clustering; rerun preprocessing if missing.
    - Monitor memory when running cNMF or scVI; adjust `n_iter`, `components`, or latent dimensions for smaller datasets.
    - **Pipeline dependency errors**: When you encounter errors like "Could not find 'leiden' in adata.obs", always check and add prerequisites:
      - Before leiden/louvain clustering → ensure `'neighbors' in adata.uns`
      - Before plotting by clustering → ensure the cluster column exists in `adata.obs`
      - Before UMAP/embedding → ensure PCA or another dimensionality reduction is complete
    - **Code generation pattern**: When generating multi-step code, use defensive checks rather than assuming prior steps completed successfully. This prevents cascading failures when users run steps out of order or in separate sessions.

## Examples
- "Normalise dentate gyrus cells, compare Leiden, scICE, and GMM clusters, and report ARI scores versus provided `clusters`."
- "Batch-correct three NeurIPS datasets with Harmony and scVI, produce MDE embeddings coloured by `batch` and `cell_type`, and benchmark the embeddings."
- "Fit topic and cNMF models on a preprocessed AnnData object, retrieve classifier-refined cluster calls, and visualise the resulting programs on UMAP."

## References
- Clustering walkthrough: [`t_cluster.ipynb`](../../omicverse_guide/docs/Tutorials-single/t_cluster.ipynb)
- Batch integration walkthrough: [`t_single_batch.ipynb`](../../omicverse_guide/docs/Tutorials-single/t_single_batch.ipynb)
- Quick copy/paste commands: [`reference.md`](reference.md)

Related Skills

tooluniverse-single-cell

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready single-cell and expression matrix analysis using scanpy, anndata, and scipy. Performs scRNA-seq QC, normalization, PCA, UMAP, Leiden/Louvain clustering, differential expression (Wilcoxon, t-test, DESeq2), cell type annotation, per-cell-type statistical analysis, gene-expression correlation, batch correction (Harmony), trajectory inference, and cell-cell communication analysis. NEW: Analyzes ligand-receptor interactions between cell types using OmniPath (CellPhoneDB, CellChatDB), scores communication strength, identifies signaling cascades, and handles multi-subunit receptor complexes. Integrates with ToolUniverse gene annotation tools (HPA, Ensembl, MyGene, UniProt) and enrichment tools (gseapy, PANTHER, STRING). Supports h5ad, 10X, CSV/TSV count matrices, and pre-annotated datasets. Use when analyzing single-cell RNA-seq data, studying cell-cell interactions, performing cell type differential expression, computing gene-expression correlations by cell type, analyzing tumor-immune communication, or answering questions about scRNA-seq datasets.

tcga-bulk-data-preprocessing-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.

spatial-transcriptomics-tutorials-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide users through omicverse's spatial transcriptomics tutorials covering preprocessing, deconvolution, and downstream modelling workflows across Visium, Visium HD, Stereo-seq, and Slide-seq datasets.

single-trajectory-analysis

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide to reproducing OmicVerse trajectory workflows spanning PAGA, Palantir, VIA, velocity coupling, and fate scoring notebooks.

single2spatial-spatial-mapping

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Map scRNA-seq atlases onto spatial transcriptomics slides using omicverse's Single2Spatial workflow for deep-forest training, spot-level assessment, and marker visualisation.

single-cell-preprocessing-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.

single-cell-multi-omics-integration

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Quick-reference sheet for OmicVerse tutorials spanning MOFA, GLUE pairing, SIMBA integration, TOSICA transfer, and StaVIA cartography.

single-cell-downstream-analysis

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Checklist-style reference for OmicVerse downstream tutorials covering AUCell scoring, metacell DEG, and related exports.

single-cell-cellphonedb-communication-mapping

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Run omicverse's CellPhoneDB v5 wrapper on annotated single-cell data to infer ligand-receptor networks and produce CellChat-style visualisations.

single-cell-rna-qc

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

single-cell-annotation-skills-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through SCSA, MetaTiME, CellVote, CellMatch, GPTAnno, and weighted KNN transfer workflows for annotating single-cell modalities.

cellxgene-census

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Query CZ CELLxGENE Census (61M+ cells). Filter by cell type/tissue/disease, retrieve expression data, integrate with scanpy/PyTorch, for population-scale single-cell analysis.