scrna-orchestrator
Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.
Best use case
scrna-orchestrator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.
Teams using scrna-orchestrator should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/scrna-orchestrator/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How scrna-orchestrator Compares
| Feature / Agent | scrna-orchestrator | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
# 🦖 scRNA Orchestrator
You are **scRNA Orchestrator**, a specialised ClawBio agent for local single-cell RNA-seq analysis with Scanpy.
## Why This Exists
Single-cell workflows are easy to misconfigure and hard to reproduce when run ad hoc.
- **Without it**: Users manually stitch QC, normalization, clustering, marker analysis, and latent downstream interpretation with inconsistent defaults.
- **With it**: One command produces a consistent `report.md`, figures, tables, structured metadata, and a reproducibility bundle, whether the graph is built from PCA or `X_scvi`.
- **Why ClawBio**: The workflow is local-first, explicit about assumptions (raw counts), and ships machine-readable outputs.
## Core Capabilities
1. **QC and Filtering**: Mitochondrial percentage filtering and min genes/cells thresholds.
2. **Optional Doublet Detection**: Scrublet on QC-filtered raw counts before downstream analysis.
3. **Preprocessing**: Library-size normalization, `log1p`, and HVG selection.
4. **Embedding and Clustering**: PCA or latent-representation neighbors graph, UMAP, Leiden clustering.
5. **Cluster Markers**: Wilcoxon cluster-vs-rest marker detection on normalized full-gene expression.
6. **Optional Cell Type Annotation**: Local-only CellTypist annotation aggregated to cluster-level putative labels.
7. **Optional Dataset-Level Contrasts**: All-pairs Wilcoxon contrastive marker analysis across the observed values of any `obs` column.
8. **Optional Within-Cluster Contrasts**: All-pairs Wilcoxon contrastive marker analysis inside each Leiden cluster or another chosen partition column.
9. **Reporting**: Markdown report, CSV/TSV tables, PNG figures, and reproducibility files.
## Input Formats
| Format | Extension | Required Fields | Example |
|--------|-----------|-----------------|---------|
| AnnData raw counts or latent downstream artifact | `.h5ad` | Raw count matrix in `X` or recoverable raw counts in `layers["counts"]`; optional latent rep in `obsm["X_scvi"]`; cell metadata in `obs`; gene metadata in `var` | `pbmc_raw.h5ad`, `integrated.h5ad` |
| 10x Matrix Market | directory, `.mtx`, `.mtx.gz` | `matrix.mtx(.gz)` plus matching `barcodes.tsv(.gz)` and `features.tsv(.gz)` or `genes.tsv(.gz)` | `filtered_feature_bc_matrix/` |
| Demo mode | n/a | none | `python clawbio.py run scrna --demo` |
Notes:
- Processed/normalized/scaled `.h5ad` inputs are rejected unless they are a recoverable latent downstream artifact with raw counts preserved in `layers["counts"]`.
- 10x input can be passed as the containing directory or directly as `matrix.mtx(.gz)`.
- `pbmc3k_processed`-style inputs are out of scope for this skill.
## Workflow
When the user asks for scRNA QC/clustering/markers/annotation/contrastive markers:
1. **Validate**: Check raw-count `.h5ad` or 10x Matrix Market input (or `--demo`), and reject processed-like matrices.
2. **Filter**: Run QC filtering, and optionally remove predicted doublets with Scrublet.
3. **Process**: Normalize, `log1p`, select HVGs, and build the graph from PCA or a latent rep such as `X_scvi`.
4. **Analyze**:
- Always run cluster marker analysis (`leiden`, Wilcoxon).
- Optionally run CellTypist on the normalized full-gene matrix.
- Optionally run dataset-level contrasts, within-cluster contrasts, or both when `--contrast-groupby` is provided.
5. **Generate**: Write `report.md`, `result.json`, tables, figures, and reproducibility bundle.
## CLI Reference
```bash
# Standard usage
python skills/scrna-orchestrator/scrna_orchestrator.py \
--input <input.h5ad> --output <report_dir>
# 10x Matrix Market directory
python skills/scrna-orchestrator/scrna_orchestrator.py \
--input <filtered_feature_bc_matrix_dir> --output <report_dir>
# Direct matrix.mtx(.gz) path
python skills/scrna-orchestrator/scrna_orchestrator.py \
--input <matrix.mtx.gz> --output <report_dir>
# Demo mode
python skills/scrna-orchestrator/scrna_orchestrator.py \
--demo --output <report_dir>
# Optional doublet detection
python skills/scrna-orchestrator/scrna_orchestrator.py \
--input <input.h5ad> --output <report_dir> \
--doublet-method scrublet
# Optional CellTypist annotation
python skills/scrna-orchestrator/scrna_orchestrator.py \
--input <input.h5ad> --output <report_dir> \
--annotate celltypist --annotation-model Immune_All_Low
# Optional dataset-level pairwise contrasts
python skills/scrna-orchestrator/scrna_orchestrator.py \
--input <input.h5ad> --output <report_dir> \
--contrast-groupby <obs_column> --contrast-scope dataset
# Optional dataset-level + within-cluster contrasts together
python skills/scrna-orchestrator/scrna_orchestrator.py \
--input <input.h5ad> --output <report_dir> \
--contrast-groupby <obs_column> --contrast-scope both \
--contrast-clusterby leiden
# Optional latent downstream mode
python skills/scrna-orchestrator/scrna_orchestrator.py \
--input <integrated.h5ad> --output <report_dir> \
--use-rep X_scvi
# Via ClawBio runner
python clawbio.py run scrna --input <input.h5ad> --output <report_dir>
python clawbio.py run scrna --input <filtered_feature_bc_matrix_dir> --output <report_dir>
python clawbio.py run scrna --demo
```
## Demo
```bash
python clawbio.py run scrna --demo
python clawbio.py run scrna --demo --doublet-method scrublet
```
Expected output:
- `report.md` with QC, clustering, markers, and optional annotation/contrast summaries
- figure files (`qc_violin.png`, `umap_leiden.png`, `marker_dotplot.png`)
- marker, doublet, annotation, dataset-level contrast, and within-cluster contrast tables when enabled
- reproducibility bundle
## Algorithm / Methodology
1. **QC**:
- Compute QC metrics (`n_genes_by_counts`, `total_counts`, `pct_counts_mt`)
- Filter by `min_genes`, `min_cells`, `max_mt_pct`
2. **Optional doublet detection**:
- `scanpy.pp.scrublet` on QC-filtered raw counts
- Remove predicted doublets before normalization and clustering
3. **Preprocess**:
- Normalize total counts to `1e4`
- Apply `log1p`
- Select HVGs (`flavor="seurat"`)
4. **Embed and cluster**:
- Scale (`max_value=10`) on the HVG branch
- PCA, neighbors graph, UMAP
- Leiden clustering
5. **Markers**:
- `scanpy.tl.rank_genes_groups(groupby="leiden", method="wilcoxon", pts=True)`
6. **Optional annotation**:
- Run local CellTypist on normalized/log1p full-gene expression
- Aggregate per-cell predictions to cluster-level majority labels with support and confidence
7. **Optional dataset-level contrasts**:
- For every unordered pair of observed groups in `--contrast-groupby`, run `scanpy.tl.rank_genes_groups(..., groups=[group1], reference=group2, method="wilcoxon", pts=True)`
- Export full statistics and top genes by score per pairwise comparison
8. **Optional within-cluster contrasts**:
- For every cluster in `--contrast-clusterby` and every unordered pair of observed groups in `--contrast-groupby`, run the same Wilcoxon contrast on the cluster subset
- Skip cluster/comparison pairs where either side has fewer than 2 cells, and report the skipped count
## Example Queries
- "Run standard QC and clustering on my h5ad file"
- "Cluster my 10x matrix.mtx directory"
- "Find marker genes for each cluster"
- "Generate a UMAP coloured by cluster"
- "Remove predicted doublets before clustering"
- "Assign putative CellTypist labels to clusters"
- "Run all pairwise contrastive markers for treated vs control vs rescue"
- "Find within-cluster treatment markers in each Leiden cluster"
## Output Structure
```text
output_directory/
├── report.md
├── result.json
├── figures/
│ ├── qc_violin.png
│ ├── umap_leiden.png
│ └── marker_dotplot.png
├── tables/
│ ├── cluster_summary.csv
│ ├── markers_top.csv
│ ├── markers_top.tsv
│ ├── doublet_summary.csv # only when doublet detection is enabled
│ ├── cluster_annotations.csv # only when annotation is enabled
│ ├── contrastive_markers_full.csv # only when dataset-level contrasts are enabled
│ ├── contrastive_markers_top.csv # only when dataset-level contrasts are enabled
│ ├── within_cluster_contrastive_markers_full.csv # only when within-cluster contrasts are enabled
│ └── within_cluster_contrastive_markers_top.csv # only when within-cluster contrasts are enabled
└── reproducibility/
├── commands.sh
├── environment.yml
└── checksums.sha256
```
## Dependencies
**Required**:
- `scanpy` >= 1.10
- `anndata` >= 0.10
- `scipy`
- `numpy`, `pandas`, `matplotlib`, `leidenalg`, `python-igraph`
**Optional**:
- `scrublet` for `--doublet-method scrublet`
- `celltypist` for `--annotate celltypist`
**Out of scope**:
- `scvi-tools` / `scANVI`
## Safety
- **Local-first**: No patient data upload.
- **Disclaimer**: Reports include the ClawBio medical disclaimer.
- **Input guardrails**: Rejects processed-like matrices to reduce invalid biological inferences.
- **Annotation caution**: CellTypist labels are **putative** and model-dependent, not definitive biology.
- **Model downloads**: Runtime CellTypist model downloads are intentionally disabled.
- **Reproducibility**: Writes command/environment/checksum bundle.
## Integration with Bio Orchestrator
**Trigger conditions**:
- File extension `.h5ad`, `.mtx`, or `.mtx.gz`
- User intent includes scRNA terms (single-cell, Scanpy, clustering, marker genes, contrastive markers, doublets, annotation)
**Current limitations**:
- Raw-count `.h5ad` and 10x Matrix Market only
- CellTypist support is human-model focused and requires a locally installed model
## Status
**MVP implemented** -- supports `.h5ad` and 10x Matrix Market input, PBMC3k-first demo data (fallback to synthetic on failure), opt-in Scrublet doublet detection, opt-in local CellTypist annotation, opt-in latent downstream mode from `integrated.h5ad`, and opt-in dataset-level plus within-cluster pairwise contrastive markers.
## Citations
- [Scanpy documentation](https://scanpy.readthedocs.io/) — analysis API and methods.
- [AnnData documentation](https://anndata.readthedocs.io/) — data model.
- [Leiden algorithm paper](https://www.nature.com/articles/s41598-019-41695-z) — community detection.
- [Scrublet paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1736-8) — computational doublet detection.
- [CellTypist documentation](https://www.celltypist.org/) — model-based immune and general cell annotation.Related Skills
scrna-embedding
Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.
bio-orchestrator
Meta-agent that routes bioinformatics requests to specialised sub-skills. Handles file type detection, analysis planning, report generation, and reproducibility export.
wes-clinical-report-es
Generates professional clinical PDF reports in Spanish from WES (Whole Exome Sequencing) data with clinical interpretation, pharmacogenomic alerts, and follow-up recommendations.
wes-clinical-report-en
Generates professional clinical PDF reports in English from WES (Whole Exome Sequencing) data with clinical interpretation summary, pharmacogenomic alerts, and follow-up recommendations.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
variant-annotation
Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.
ukb-navigator
Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.
target-validation-scorer
Evidence-grounded target validation scoring with GO/NO-GO decisions for drug discovery campaigns
struct-predictor
Protein structure prediction with Boltz-2. Accepts YAML inputs (single protein or multi-chain complex), runs boltz predict, extracts per-residue pLDDT and PAE confidence, and writes a markdown report with figures.
soul2dna
Compile SOUL.md character profiles into synthetic diploid genomes (.genome.json) via trait-to-allele mapping
seq-wrangler
Sequence QC, alignment, and BAM processing. Wraps FastQC, BWA/Bowtie2, SAMtools for automated read-to-BAM pipelines.
rnaseq-de
Differential expression analysis for bulk RNA-seq and pseudo-bulk count matrices with QC, PCA, and contrast testing.