single-cell-clustering-and-batch-correction-with-omicverse

Guide Claude through omicverse's single-cell clustering workflow, covering preprocessing, QC, multimethod clustering, topic modeling, cNMF, and cross-batch integration as demonstrated in t_cluster.ipynb and t_single_batch.ipynb.

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

single-cell-clustering-and-batch-correction-with-omicverse is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using single-cell-clustering-and-batch-correction-with-omicverse should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/single-clustering/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/single-clustering/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/single-clustering/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How single-cell-clustering-and-batch-correction-with-omicverse Compares

Feature / Agent	single-cell-clustering-and-batch-correction-with-omicverse	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

Top AI Agents for Productivity

See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.

SKILL.md Source

# Single-cell clustering and batch correction with omicverse

## Overview
This skill distills the single-cell tutorials [`t_cluster.ipynb`](../../omicverse_guide/docs/Tutorials-single/t_cluster.ipynb) and [`t_single_batch.ipynb`](../../omicverse_guide/docs/Tutorials-single/t_single_batch.ipynb). Use it when a user wants to preprocess an `AnnData` object, explore clustering alternatives (Leiden, Louvain, scICE, GMM, topic/cNMF models), and evaluate or harmonise batches with omicverse utilities.

## Instructions
1. **Import libraries and set plotting defaults**
   - Load `omicverse as ov`, `scanpy as sc`, and plotting helpers (`scvelo as scv` when using dentate gyrus demo data).
   - Apply `ov.plot_set()` or `ov.utils.ov_plot_set()` so figures adopt omicverse styling before embedding plots.
2. **Load data and annotate batches**
   - For demo clustering, fetch `scv.datasets.dentategyrus()`; for integration, read provided `.h5ad` files via `ov.read()` and set `adata.obs['batch']` identifiers for each cohort.
   - Confirm inputs are sparse numeric matrices; convert with `adata.X = adata.X.astype(np.int64)` when required for QC steps.
3. **Run quality control**
   - Execute `ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, batch_key='batch')` to drop low-quality cells and inspect summary statistics per batch.
   - Save intermediate filtered objects (`adata.write_h5ad(...)`) so users can resume from clean checkpoints.
4. **Preprocess and select features**
   - Call `ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=3000, batch_key=None)` to normalise, log-transform, and flag highly variable genes; assign `adata.raw = adata` and subset to `adata.var.highly_variable_features` for downstream modelling.
   - Scale expression (`ov.pp.scale(adata)`) and compute PCA scores with `ov.pp.pca(adata, layer='scaled', n_pcs=50)`. Encourage reviewing variance explained via `ov.utils.plot_pca_variance_ratio(adata)`.
5. **Construct neighbourhood graph and baseline clustering**
   - Build neighbour graph using `sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')` or `ov.pp.neighbors(...)`.
   - Generate Leiden or Louvain labels through `ov.utils.cluster(adata, method='leiden'|'louvain', resolution=1)`, `ov.single.leiden(adata, resolution=1.0)`, or `ov.pp.leiden(adata, resolution=1)`; remind users that resolution tunes granularity.
   - **IMPORTANT - Dependency checks**: Always verify prerequisites before clustering or plotting:
     ```python
     # Before clustering: check neighbors graph exists
     if 'neighbors' not in adata.uns:
         if 'X_pca' in adata.obsm:
             ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
         else:
             raise ValueError("PCA must be computed before neighbors graph")

     # Before plotting by cluster: check clustering was performed
     if 'leiden' not in adata.obs:
         ov.single.leiden(adata, resolution=1.0)
     ```
   - Visualise embeddings with `ov.pl.embedding(adata, basis='X_umap', color=['clusters','leiden'], frameon='small', wspace=0.5)` and confirm cluster separation. Always check that columns in `color=` parameter exist in `adata.obs` before plotting.
6. **Explore advanced clustering strategies**
   - **scICE consensus**: instantiate `model = ov.utils.cluster(adata, method='scICE', use_rep='scaled|original|X_pca', resolution_range=(4,20), n_boot=50, n_steps=11)` and inspect stability via `model.plot_ic(figsize=(6,4))` before selecting `model.best_k` groups.
   - **Gaussian mixtures**: run `ov.utils.cluster(..., method='GMM', n_components=21, covariance_type='full', tol=1e-9, max_iter=1000)` for model-based assignments.
   - **Topic modelling**: fit `LDA_obj = ov.utils.LDA_topic(...)`, review `LDA_obj.plot_topic_contributions(6)`, derive cluster calls with `LDA_obj.predicted(k)` and optionally refine using `LDA_obj.get_results_rfc(...)`.
   - **cNMF programs**: initialise `cnmf_obj = ov.single.cNMF(... components=np.arange(5,11), n_iter=20, num_highvar_genes=2000, output_dir=...)`, factorise (`factorize`, `combine`), select K via `k_selection_plot`, and propagate usage scores back with `cnmf_obj.get_results(...)` and `cnmf_obj.get_results_rfc(...)`.
7. **Evaluate clustering quality**
   - Compare predicted labels against known references with `adjusted_rand_score(adata.obs['clusters'], adata.obs['leiden'])` and report metrics for each method (Leiden, Louvain, GMM, LDA variants, cNMF models) to justify chosen parameters.
8. **Embed with multiple layouts**
   - Use `ov.utils.mde(...)` to create MDE projections from different latent spaces (`adata.obsm["scaled|original|X_pca"]`, harmonised embeddings, topic compositions) and plot via `ov.utils.embedding(..., color=['batch','cell_type'])` or `ov.pl.embedding` for consistent review of cluster/batch mixing.
9. **Perform batch correction and integration**
   - Apply `ov.single.batch_correction(adata, batch_key='batch', methods='harmony'|'combat'|'scanorama'|'scVI'|'CellANOVA', n_pcs=50, ...)` sequentially to generate harmonised embeddings stored in `adata.obsm` (`X_harmony`, `X_combat`, `X_scanorama`, `X_scVI`, `X_cellanova`). For `scVI`, mention latent size (`n_latent=30`) and `gene_likelihood="nb"`; for CellANOVA define control pools via `control_dict`.
   - After each correction, project to 2D with `ov.utils.mde` and visualise `batch` vs `cell_type` to check mixing and conservation.
10. **Benchmark integration performance**
    - Persist final object (`adata.write_h5ad('neurips2021_batch_all.h5ad', compression='gzip')`) and reload when necessary.
    - Use `scib_metrics.benchmark.Benchmarker` with embeddings list (`["X_pca", "X_combat", "X_harmony", "X_cellanova", "X_scanorama", "X_mira_topic", "X_mira_feature", "X_scVI"]`) to compute batch-vs-biology trade-offs via `bm.benchmark()` and summarise with `bm.plot_results_table(min_max_scale=False)`.
11. **General troubleshooting**
    - Ensure `adata.raw` captures the unscaled log-normalised matrix before subsetting to HVGs.
    - Confirm `use_rep='scaled|original|X_pca'` strings exist in `adata.obsm` prior to clustering; rerun preprocessing if missing.
    - Monitor memory when running cNMF or scVI; adjust `n_iter`, `components`, or latent dimensions for smaller datasets.
    - **Pipeline dependency errors**: When you encounter errors like "Could not find 'leiden' in adata.obs", always check and add prerequisites:
      - Before leiden/louvain clustering → ensure `'neighbors' in adata.uns`
      - Before plotting by clustering → ensure the cluster column exists in `adata.obs`
      - Before UMAP/embedding → ensure PCA or another dimensionality reduction is complete
    - **Code generation pattern**: When generating multi-step code, use defensive checks rather than assuming prior steps completed successfully. This prevents cascading failures when users run steps out of order or in separate sessions.

## Examples
- "Normalise dentate gyrus cells, compare Leiden, scICE, and GMM clusters, and report ARI scores versus provided `clusters`."
- "Batch-correct three NeurIPS datasets with Harmony and scVI, produce MDE embeddings coloured by `batch` and `cell_type`, and benchmark the embeddings."
- "Fit topic and cNMF models on a preprocessed AnnData object, retrieve classifier-refined cluster calls, and visualise the resulting programs on UMAP."

## References
- Clustering walkthrough: [`t_cluster.ipynb`](../../omicverse_guide/docs/Tutorials-single/t_cluster.ipynb)
- Batch integration walkthrough: [`t_single_batch.ipynb`](../../omicverse_guide/docs/Tutorials-single/t_single_batch.ipynb)
- Quick copy/paste commands: [`reference.md`](reference.md)

Related Skills

tooluniverse-single-cell

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready single-cell and expression matrix analysis using scanpy, anndata, and scipy. Performs scRNA-seq QC, normalization, PCA, UMAP, Leiden/Louvain clustering, differential expression (Wilcoxon, t-test, DESeq2), cell type annotation, per-cell-type statistical analysis, gene-expression correlation, batch correction (Harmony), trajectory inference, and cell-cell communication analysis. NEW: Analyzes ligand-receptor interactions between cell types using OmniPath (CellPhoneDB, CellChatDB), scores communication strength, identifies signaling cascades, and handles multi-subunit receptor complexes. Integrates with ToolUniverse gene annotation tools (HPA, Ensembl, MyGene, UniProt) and enrichment tools (gseapy, PANTHER, STRING). Supports h5ad, 10X, CSV/TSV count matrices, and pre-annotated datasets. Use when analyzing single-cell RNA-seq data, studying cell-cell interactions, performing cell type differential expression, computing gene-expression correlations by cell type, analyzing tumor-immune communication, or answering questions about scRNA-seq datasets.

tcga-bulk-data-preprocessing-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.

spatial-transcriptomics-tutorials-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guide users through omicverse's spatial transcriptomics tutorials covering preprocessing, deconvolution, and downstream modelling workflows across Visium, Visium HD, Stereo-seq, and Slide-seq datasets.

single-trajectory-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guide to reproducing OmicVerse trajectory workflows spanning PAGA, Palantir, VIA, velocity coupling, and fate scoring notebooks.

single2spatial-spatial-mapping

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Map scRNA-seq atlases onto spatial transcriptomics slides using omicverse's Single2Spatial workflow for deep-forest training, spot-level assessment, and marker visualisation.

single-cell-preprocessing-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.

single-cell-multi-omics-integration

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Quick-reference sheet for OmicVerse tutorials spanning MOFA, GLUE pairing, SIMBA integration, TOSICA transfer, and StaVIA cartography.

single-cell-downstream-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Checklist-style reference for OmicVerse downstream tutorials covering AUCell scoring, metacell DEG, and related exports.

single-cell-cellphonedb-communication-mapping

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Run omicverse's CellPhoneDB v5 wrapper on annotated single-cell data to infer ligand-receptor networks and produce CellChat-style visualisations.

single-cell-rna-qc

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

single-cell-annotation-skills-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through SCSA, MetaTiME, CellVote, CellMatch, GPTAnno, and weighted KNN transfer workflows for annotating single-cell modalities.

cellxgene-census

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Query CZ CELLxGENE Census (61M+ cells). Filter by cell type/tissue/disease, retrieve expression data, integrate with scanpy/PyTorch, for population-scale single-cell analysis.