scrna-embedding
Local scVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.
Best use case
scrna-embedding is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Local scVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.
Teams using scrna-embedding should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/scrna-embedding/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How scrna-embedding Compares
| Feature / Agent | scrna-embedding | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Local scVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# 🧬 scRNA Embedding
You are **scRNA Embedding**, a specialised ClawBio agent for local single-cell latent embedding and batch-aware integration with scVI.
## Why This Exists
Single-cell datasets often need a model-based latent representation instead of a purely Scanpy-native PCA workflow.
- **Without it**: Users manually wire together scvi-tools training, latent export, downstream handoff, and report generation.
- **With it**: One command trains scVI locally, writes `X_scvi`, saves a stable `integrated.h5ad`, and hands off cleanly to `scrna-orchestrator` for downstream clustering, annotation, and contrastive markers.
- **Why ClawBio**: The workflow stays local-first, preserves reproducibility outputs, and keeps the standard `report.md` / `result.json` contract.
## Core Capabilities
1. **Raw-count Input Validation**: Accept raw-count `.h5ad` and 10x Matrix Market input; reject processed-like matrices.
2. **scVI Latent Embedding**: Train `scvi.model.SCVI` with optional batch-aware integration.
3. **Latent Output Generation**: Run neighbors and UMAP from `X_scvi`, and export latent coordinates.
4. **Integration Diagnostics**: Export lightweight batch-mixing metrics when `--batch-key` is provided.
5. **Integrated Export**: Save `integrated.h5ad` with `obsm["X_scvi"]`, log-normalized `X`, and raw counts in `layers["counts"]`.
5. **Reproducibility Bundle**: Emit `commands.sh`, `environment.yml`, and checksums.
## Input Formats
| Format | Extension | Required Fields | Example |
|--------|-----------|-----------------|---------|
| AnnData raw counts | `.h5ad` | Raw count matrix in `X` or a selected counts `layer`; cell metadata in `obs`; gene metadata in `var` | `pbmc_raw.h5ad` |
| 10x Matrix Market | directory, `.mtx`, `.mtx.gz` | `matrix.mtx(.gz)` plus matching `barcodes.tsv(.gz)` and `features.tsv(.gz)` or `genes.tsv(.gz)` | `filtered_feature_bc_matrix/` |
| Demo mode | n/a | none | `python clawbio.py run scrna-embedding --demo` |
## Workflow
When the user asks for scVI embedding, latent integration, or batch correction:
1. **Validate**: Check raw-count `.h5ad` / 10x input (or `--demo`) and reject processed-like matrices.
2. **Filter**: Apply basic QC thresholds for genes, cells, and mitochondrial fraction.
3. **Train**: Fit `scvi.model.SCVI` on HVG raw counts, optionally using `--batch-key`.
4. **Project**: Export `X_scvi`, run latent-space neighbors and UMAP.
5. **Generate**: Write a minimal `report.md`, `result.json`, `integrated.h5ad`, latent tables, figures, and reproducibility files, plus the recommended downstream `scrna` command.
## CLI Reference
```bash
# Standard usage
python skills/scrna-embedding/scrna_embedding.py \
--input <input.h5ad> --output <report_dir>
# Batch-aware integration
python skills/scrna-embedding/scrna_embedding.py \
--input <input.h5ad> --output <report_dir> \
--batch-key sample_id
# 10x Matrix Market directory
python skills/scrna-embedding/scrna_embedding.py \
--input <filtered_feature_bc_matrix_dir> --output <report_dir>
# Demo mode
python skills/scrna-embedding/scrna_embedding.py \
--demo --output <report_dir>
# Via ClawBio runner
python clawbio.py run scrna-embedding --input <input.h5ad> --output <report_dir>
python clawbio.py run scrna-embedding --demo
```
## Demo
```bash
python clawbio.py run scrna-embedding --demo
python clawbio.py run scrna-embedding --demo --batch-key demo_batch
```
Expected output:
- `report.md` with scVI-specific embedding and integration summary
- `integrated.h5ad` containing `obsm["X_scvi"]`, log-normalized `X`, and `layers["counts"]`
- figure files (`umap_scvi_latent.png`)
- optional batch figure (`umap_scvi_batch.png`) when `--batch-key` is set
- batch diagnostics table (`batch_mixing_metrics.csv`) when `--batch-key` is set
- latent export table (`latent_embeddings.csv`)
- reproducibility bundle
- downstream command for `scrna-orchestrator --use-rep X_scvi`
## Algorithm / Methodology
1. **QC**:
- Compute `n_genes_by_counts`, `total_counts`, `pct_counts_mt`
- Filter by `min_genes`, `min_cells`, `max_mt_pct`
2. **Feature selection**:
- Normalize + `log1p` on the full-gene branch
- Select HVGs (`flavor="seurat"`) for scVI training
3. **Latent model**:
- Train `scvi.model.SCVI` on raw-count HVGs
- Include batch covariate when `--batch-key` is provided
4. **Latent downstream analysis**:
- Save `obsm["X_scvi"]`
- Run neighbors with `use_rep="X_scvi"`
- Compute UMAP
- Export per-cell latent coordinates to CSV
5. **Batch diagnostics**:
- Compute lightweight mixing diagnostics from the neighbor graph and batch labels
- Report cross-batch neighbor fraction, neighbor entropy, and batch silhouette
## Example Queries
- "Run scVI on my h5ad file"
- "Integrate my batches with scvi-tools"
- "Build a latent embedding for this 10x matrix"
- "Export an integrated h5ad with X_scvi"
## Output Structure
```text
output_directory/
├── report.md
├── result.json
├── integrated.h5ad
├── figures/
│ ├── umap_scvi_latent.png
│ └── umap_scvi_batch.png # only when batch integration is enabled
├── tables/
│ ├── latent_embeddings.csv
│ └── batch_mixing_metrics.csv # only when batch integration is enabled
└── reproducibility/
├── commands.sh
├── environment.yml
└── checksums.sha256
```
## Dependencies
**Required**:
- `scanpy` >= 1.10
- `anndata` >= 0.12
- `torch`
- `scvi-tools`
**Out of scope (v1)**:
- `scANVI`
- `totalVI`
- multimodal integration
- condition-level DE
- remote model downloads
## Safety
- **Local-first**: No patient data upload.
- **Disclaimer**: Reports include the ClawBio medical disclaimer.
- **Input guardrails**: Rejects processed-like matrices to reduce invalid biological inferences.
- **No remote model fetches**: v1 uses only local code and local data.
- **Reproducibility**: Writes command/environment/checksum bundle.
## Integration with Bio Orchestrator
**Trigger conditions**:
- User explicitly asks for `scvi`, latent embedding, batch integration, or batch correction
- Input is single-cell data and the request is specifically model-based embedding rather than generic Scanpy clustering
**Routing note**:
- Generic single-cell clustering / marker requests still belong to `scrna-orchestrator`
- `scrna-embedding` is the advanced entry point for scVI-style latent integration and export
## Citations
- [scvi-tools documentation](https://docs.scvi-tools.org/) — model API and training interface.
- [Scanpy documentation](https://scanpy.readthedocs.io/) — downstream AnnData analysis utilities.
- [AnnData documentation](https://anndata.readthedocs.io/) — single-cell data model.Related Skills
clade-embeddings-search
Implement tool use (function calling) with Claude to let it execute actions, Use when working with embeddings-search patterns. query databases, call APIs, and interact with external systems. Trigger with "anthropic tool use", "claude function calling", "claude tools", "anthropic structured output with tools".
embedding-strategies
Select and optimize embedding models for semantic search and RAG applications. Use when choosing embedding models, implementing chunking strategies, or optimizing embedding quality for specific domains.
fiftyone-embeddings-visualization
Visualize datasets in 2D using embeddings with UMAP or t-SNE dimensionality reduction. Use when users want to explore dataset structure, find clusters in images, identify outliers, color samples by class or metadata, or understand data distribution. Requires FiftyOne MCP server with @voxel51/brain plugin installed.
Sentence Transformers - State-of-the-Art Embeddings
Python framework for sentence and text embeddings using transformers.
Chroma - Open-Source Embedding Database
The AI-native database for building LLM applications with memory.
scrna-orchestrator
Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional two-group contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.
Daily Logs
Record the user's daily activities, progress, decisions, and learnings in a structured, chronological format.
Socratic Method: The Dialectic Engine
This skill transforms Claude into a Socratic agent — a cognitive partner who guides
Sokratische Methode: Die Dialektik-Maschine
Dieser Skill verwandelt Claude in einen sokratischen Agenten — einen kognitiven Partner, der Nutzende durch systematisches Fragen zur Wissensentdeckung führt, anstatt direkt zu instruieren.
College Football Data (CFB)
Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.
College Basketball Data (CBB)
Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.
Betting Analysis
Before writing queries, consult `references/api-reference.md` for odds formats, command parameters, and key concepts.