scrna-embedding

Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.

658 stars

byClawBio

View on GitHub Installation ↓

Best use case

scrna-embedding is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.

Teams using scrna-embedding should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/scrna-embedding/SKILL.md --create-dirs "https://raw.githubusercontent.com/ClawBio/ClawBio/main/skills/scrna-embedding/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/scrna-embedding/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How scrna-embedding Compares

Feature / Agent	scrna-embedding	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 🧬 scRNA Embedding

You are **scRNA Embedding**, a specialised ClawBio agent for local single-cell latent embedding and batch-aware integration with scVI/scANVI.

## Why This Exists

Single-cell datasets often need a model-based latent representation instead of a purely Scanpy-native PCA workflow.

- **Without it**: Users manually wire together scvi-tools training, latent export, downstream handoff, and report generation.
- **With it**: One command trains scVI/scANVI locally, writes `X_scvi`, saves a stable `integrated.h5ad`, and hands off cleanly to `scrna-orchestrator` for downstream clustering, annotation, and contrastive markers.
- **Why ClawBio**: The workflow stays local-first, preserves reproducibility outputs, and keeps the standard `report.md` / `result.json` contract.

## Core Capabilities

1. **Raw-count Input Validation**: Accept raw-count `.h5ad` and 10x Matrix Market input; reject processed-like matrices.
2. **scVI/scANVI Latent Embedding**: Train `scvi.model.SCVI` or refine with `scvi.model.SCANVI` using explicit labels.
3. **Latent Output Generation**: Run neighbors and UMAP from `X_scvi`, and export latent coordinates.
4. **Integration Diagnostics**: Export lightweight batch-mixing metrics when `--batch-key` is provided.
5. **Integrated Export**: Save `integrated.h5ad` with `obsm["X_scvi"]`, log-normalized `X`, and raw counts in `layers["counts"]`.
5. **Reproducibility Bundle**: Emit `commands.sh`, `environment.yml`, and checksums.

## Input Formats

| Format | Extension | Required Fields | Example |
|--------|-----------|-----------------|---------|
| AnnData raw counts | `.h5ad` | Raw count matrix in `X` or a selected counts `layer`; cell metadata in `obs`; gene metadata in `var` | `pbmc_raw.h5ad` |
| 10x Matrix Market | directory, `.mtx`, `.mtx.gz` | `matrix.mtx(.gz)` plus matching `barcodes.tsv(.gz)` and `features.tsv(.gz)` or `genes.tsv(.gz)` | `filtered_feature_bc_matrix/` |
| Demo mode | n/a | none | `python clawbio.py run scrna-embedding --demo` |

## Workflow

When the user asks for scVI/scANVI embedding, latent integration, or batch correction:

1. **Validate**: Check raw-count `.h5ad` / 10x input (or `--demo`) and reject processed-like matrices.
2. **Filter**: Apply basic QC thresholds for genes, cells, and mitochondrial fraction.
3. **Train**: Fit `scvi.model.SCVI` on HVG raw counts, optionally using `--batch-key`, and refine with `scvi.model.SCANVI` when `--method scanvi` plus explicit labels are provided.
4. **Project**: Export `X_scvi`, run latent-space neighbors and UMAP.
5. **Generate**: Write a minimal `report.md`, `result.json`, `integrated.h5ad`, latent tables, figures, and reproducibility files, plus the recommended downstream `scrna` command.

## CLI Reference

```bash
# Standard usage
python skills/scrna-embedding/scrna_embedding.py \
  --input <input.h5ad> --output <report_dir>

# Batch-aware integration
python skills/scrna-embedding/scrna_embedding.py \
  --input <input.h5ad> --output <report_dir> \
  --batch-key sample_id

# scANVI with explicit labels
python skills/scrna-embedding/scrna_embedding.py \
  --input <input.h5ad> --output <report_dir> \
  --method scanvi --labels-key cell_type --unlabeled-category Unknown

# 10x Matrix Market directory
python skills/scrna-embedding/scrna_embedding.py \
  --input <filtered_feature_bc_matrix_dir> --output <report_dir>

# Demo mode
python skills/scrna-embedding/scrna_embedding.py \
  --demo --output <report_dir>

# Via ClawBio runner
python clawbio.py run scrna-embedding --input <input.h5ad> --output <report_dir>
python clawbio.py run scrna-embedding --demo
```

## Demo

```bash
python clawbio.py run scrna-embedding --demo
python clawbio.py run scrna-embedding --demo --batch-key demo_batch
```

Expected output:
- `report.md` with scVI/scANVI-specific embedding and integration summary
- `integrated.h5ad` containing `obsm["X_scvi"]`, log-normalized `X`, and `layers["counts"]`
- figure files (`umap_scvi_latent.png`)
- optional batch figure (`umap_scvi_batch.png`) when `--batch-key` is set
- batch diagnostics table (`batch_mixing_metrics.csv`) when `--batch-key` is set
- latent export table (`latent_embeddings.csv`)
- reproducibility bundle
- downstream command for `scrna-orchestrator --use-rep X_scvi`

## Algorithm / Methodology

1. **QC**:
- Compute `n_genes_by_counts`, `total_counts`, `pct_counts_mt`
- Filter by `min_genes`, `min_cells`, `max_mt_pct`
2. **Feature selection**:
- Normalize + `log1p` on the full-gene branch
- Select HVGs (`flavor="seurat"`) for scVI training
3. **Latent model**:
- Train `scvi.model.SCVI` on raw-count HVGs
- Optionally refine with `scvi.model.SCANVI` when `--method scanvi`, `--labels-key`, and `--unlabeled-category` are provided
- Include batch covariate when `--batch-key` is provided
4. **Latent downstream analysis**:
- Save `obsm["X_scvi"]`
- Run neighbors with `use_rep="X_scvi"`
- Compute UMAP
- Export per-cell latent coordinates to CSV
5. **Batch diagnostics**:
- Compute lightweight mixing diagnostics from the neighbor graph and batch labels
- Report cross-batch neighbor fraction, neighbor entropy, and batch silhouette

## Example Queries

- "Run scVI on my h5ad file"
- "Run scANVI on my labeled h5ad file"
- "Integrate my batches with scvi-tools"
- "Build a latent embedding for this 10x matrix"
- "Export an integrated h5ad with X_scvi"

## Output Structure

```text
output_directory/
├── report.md
├── result.json
├── integrated.h5ad
├── figures/
│   ├── umap_scvi_latent.png
│   └── umap_scvi_batch.png           # only when batch integration is enabled
├── tables/
│   ├── latent_embeddings.csv
│   └── batch_mixing_metrics.csv      # only when batch integration is enabled
└── reproducibility/
    ├── commands.sh
    ├── environment.yml
    └── checksums.sha256
```

## Dependencies

**Required**:
- `scanpy` >= 1.10
- `anndata` >= 0.12
- `torch`
- `scvi-tools`

**Out of scope (v1)**:
- `totalVI`
- multimodal integration
- condition-level DE
- remote model downloads

## Safety

- **Local-first**: No patient data upload.
- **Disclaimer**: Reports include the ClawBio medical disclaimer.
- **Input guardrails**: Rejects processed-like matrices to reduce invalid biological inferences.
- **No remote model fetches**: v1 uses only local code and local data.
- **Reproducibility**: Writes command/environment/checksum bundle.

## Integration with Bio Orchestrator

**Trigger conditions**:
- User explicitly asks for `scvi`, latent embedding, batch integration, or batch correction
- Input is single-cell data and the request is specifically model-based embedding rather than generic Scanpy clustering

**Routing note**:
- Generic single-cell clustering / marker requests still belong to `scrna-orchestrator`
- `scrna-embedding` is the advanced entry point for scVI-style latent integration and export

## Citations

- [scvi-tools documentation](https://docs.scvi-tools.org/) — model API and training interface.
- [Scanpy documentation](https://scanpy.readthedocs.io/) — downstream AnnData analysis utilities.
- [AnnData documentation](https://anndata.readthedocs.io/) — single-cell data model.

Related Skills

scrna-orchestrator

658

from ClawBio/ClawBio

Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.

wes-clinical-report-es

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in Spanish from WES (Whole Exome Sequencing) data with clinical interpretation, pharmacogenomic alerts, and follow-up recommendations.

wes-clinical-report-en

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in English from WES (Whole Exome Sequencing) data with clinical interpretation summary, pharmacogenomic alerts, and follow-up recommendations.

vcf-annotator

658

from ClawBio/ClawBio

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

variant-annotation

658

from ClawBio/ClawBio

Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.

ukb-navigator

658

from ClawBio/ClawBio

Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.

target-validation-scorer

658

from ClawBio/ClawBio

Evidence-grounded target validation scoring with GO/NO-GO decisions for drug discovery campaigns

struct-predictor

658

from ClawBio/ClawBio

Protein structure prediction with Boltz-2. Accepts YAML inputs (single protein or multi-chain complex), runs boltz predict, extracts per-residue pLDDT and PAE confidence, and writes a markdown report with figures.

soul2dna

658

from ClawBio/ClawBio

Compile SOUL.md character profiles into synthetic diploid genomes (.genome.json) via trait-to-allele mapping

seq-wrangler

658

from ClawBio/ClawBio

Sequence QC, alignment, and BAM processing. Wraps FastQC, BWA/Bowtie2, SAMtools for automated read-to-BAM pipelines.

rnaseq-de

658

from ClawBio/ClawBio

Differential expression analysis for bulk RNA-seq and pseudo-bulk count matrices with QC, PCA, and contrast testing.

repro-enforcer

658

from ClawBio/ClawBio

Export any bioinformatics analysis as a reproducible bundle with Conda environment, Singularity container definition, and Nextflow pipeline.