bio-single-cell-doublet-detection

Detect and remove doublets (multiple cells captured in one droplet) from single-cell RNA-seq data. Uses Scrublet (Python), DoubletFinder (R), and scDblFinder (R). Essential QC step before clustering to avoid artificial cell populations. Use when identifying and removing doublets from scRNA-seq data.

1,802 stars

Best use case

bio-single-cell-doublet-detection is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Detect and remove doublets (multiple cells captured in one droplet) from single-cell RNA-seq data. Uses Scrublet (Python), DoubletFinder (R), and scDblFinder (R). Essential QC step before clustering to avoid artificial cell populations. Use when identifying and removing doublets from scRNA-seq data.

Teams using bio-single-cell-doublet-detection should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-single-cell-doublet-detection/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-single-cell-doublet-detection/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/bio-single-cell-doublet-detection/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How bio-single-cell-doublet-detection Compares

Feature / Agentbio-single-cell-doublet-detectionStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Detect and remove doublets (multiple cells captured in one droplet) from single-cell RNA-seq data. Uses Scrublet (Python), DoubletFinder (R), and scDblFinder (R). Essential QC step before clustering to avoid artificial cell populations. Use when identifying and removing doublets from scRNA-seq data.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

## Version Compatibility

Reference examples tested with: matplotlib 3.8+, numpy 1.26+, scanpy 1.10+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Doublet Detection

Doublets are droplets containing two or more cells. They appear as artificial intermediate cell populations and must be removed before analysis.

## Scrublet (Python)

**Goal:** Detect and score doublets in scRNA-seq data using simulated doublet profiles.

**Approach:** Simulate artificial doublets by combining random cell pairs, embed real and simulated cells together, and score each cell's similarity to simulated doublets.

**"Remove doublets from my data"** → Identify droplets containing multiple cells by comparing each cell's profile to computationally simulated doublets, then filter flagged cells.

### Basic Usage

```python
import scrublet as scr
import scanpy as sc
import numpy as np

adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')

scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets()

adata.obs['doublet_score'] = doublet_scores
adata.obs['predicted_doublet'] = predicted_doublets

print(f'Detected {predicted_doublets.sum()} doublets ({100*predicted_doublets.mean():.1f}%)')
```

### Adjust Parameters

```python
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets(
    min_counts=2,
    min_cells=3,
    min_gene_variability_pctl=85,
    n_prin_comps=30,
    synthetic_doublet_umi_subsampling=1.0
)
```

### Visualize Doublet Scores

```python
import matplotlib.pyplot as plt

scrub.plot_histogram()
plt.savefig('doublet_histogram.pdf')

# UMAP with doublet scores
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

sc.pl.umap(adata, color=['doublet_score', 'predicted_doublet'], save='_doublets.pdf')
```

### Filter Doublets

```python
adata_filtered = adata[~adata.obs['predicted_doublet']].copy()
print(f'Kept {adata_filtered.n_obs} cells after doublet removal')
```

### Set Manual Threshold

```python
scrub = scr.Scrublet(adata.X)
doublet_scores, _ = scrub.scrub_doublets()

threshold = 0.25
predicted_doublets = doublet_scores > threshold
adata.obs['predicted_doublet'] = predicted_doublets
```

## DoubletFinder (R)

**Goal:** Detect doublets in Seurat objects using DoubletFinder's pANN-based classification.

**Approach:** Optimize the pK neighborhood parameter via parameter sweep, compute artificial nearest neighbor proportions, and classify cells as singlets or doublets.

### Basic Usage

```r
library(Seurat)
library(DoubletFinder)

seurat_obj <- Read10X(data.dir = 'filtered_feature_bc_matrix/')
seurat_obj <- CreateSeuratObject(counts = seurat_obj, min.cells = 3, min.features = 200)

seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)

sweep.res <- paramSweep(seurat_obj, PCs = 1:20, sct = FALSE)
sweep.stats <- summarizeSweep(sweep.res, GT = FALSE)
bcmvn <- find.pK(sweep.stats)

optimal_pk <- as.numeric(as.character(bcmvn$pK[which.max(bcmvn$BCmetric)]))

nExp_poi <- round(0.06 * nrow(seurat_obj@meta.data))
seurat_obj <- doubletFinder(seurat_obj, PCs = 1:20, pN = 0.25, pK = optimal_pk,
                             nExp = nExp_poi, reuse.pANN = FALSE, sct = FALSE)

colnames(seurat_obj@meta.data)
```

### With SCTransform

```r
seurat_obj <- SCTransform(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:30)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)

sweep.res <- paramSweep(seurat_obj, PCs = 1:30, sct = TRUE)
sweep.stats <- summarizeSweep(sweep.res, GT = FALSE)
bcmvn <- find.pK(sweep.stats)

optimal_pk <- as.numeric(as.character(bcmvn$pK[which.max(bcmvn$BCmetric)]))
nExp_poi <- round(0.06 * nrow(seurat_obj@meta.data))

seurat_obj <- doubletFinder(seurat_obj, PCs = 1:30, pN = 0.25, pK = optimal_pk,
                             nExp = nExp_poi, reuse.pANN = FALSE, sct = TRUE)
```

### Filter Doublets

```r
df_col <- grep('DF.classifications', colnames(seurat_obj@meta.data), value = TRUE)
seurat_obj$doublet <- seurat_obj@meta.data[[df_col]]

DimPlot(seurat_obj, group.by = 'doublet')

seurat_obj <- subset(seurat_obj, subset = doublet == 'Singlet')
```

### Adjust Expected Doublet Rate

```r
n_cells <- ncol(seurat_obj)
doublet_rate <- n_cells / 1000 * 0.008
nExp_poi <- round(doublet_rate * n_cells)
```

## scDblFinder (R/Bioconductor)

**Goal:** Detect doublets using scDblFinder's gradient-boosted classifier for fast, accurate identification.

**Approach:** Simulate doublets, train a gradient boosting classifier on real vs simulated profiles, and score each cell.

### Basic Usage

```r
library(scDblFinder)
library(SingleCellExperiment)

sce <- SingleCellExperiment(assays = list(counts = counts_matrix))
sce <- scDblFinder(sce)

table(sce$scDblFinder.class)
```

### From Seurat Object

```r
library(scDblFinder)
library(Seurat)

sce <- as.SingleCellExperiment(seurat_obj)

sce <- scDblFinder(sce)

seurat_obj$scDblFinder_class <- sce$scDblFinder.class
seurat_obj$scDblFinder_score <- sce$scDblFinder.score

DimPlot(seurat_obj, group.by = 'scDblFinder_class')

seurat_obj <- subset(seurat_obj, subset = scDblFinder_class == 'singlet')
```

### Multi-Sample Processing

```r
sce <- scDblFinder(sce, samples = 'sample_id')
```

### Adjust Parameters

```r
sce <- scDblFinder(sce,
    dbr = 0.06,
    dbr.sd = 0.015,
    nfeatures = 1500,
    dims = 20,
    k = 30
)
```

## Expected Doublet Rates

| Cells Loaded | Expected Rate |
|--------------|---------------|
| 1,000 | ~0.8% |
| 2,000 | ~1.6% |
| 5,000 | ~4.0% |
| 10,000 | ~8.0% |
| 15,000 | ~12% |

Formula: `rate ≈ cells_loaded / 1000 * 0.008`

## Compare Methods

```r
library(scDblFinder)

seurat_obj$scrublet <- scrublet_results
sce <- as.SingleCellExperiment(seurat_obj)
sce <- scDblFinder(sce)
seurat_obj$scDblFinder <- sce$scDblFinder.class

DimPlot(seurat_obj, group.by = c('doublet', 'scDblFinder', 'scrublet'), ncol = 3)

table(seurat_obj$doublet, seurat_obj$scDblFinder)
```

## Handling Heterotypic vs Homotypic Doublets

### Heterotypic Doublets
- Two different cell types
- Easier to detect (intermediate expression)
- All methods handle well

### Homotypic Doublets
- Same cell type
- Harder to detect (no intermediate signature)
- May have higher total counts

```python
adata.obs['log_counts'] = np.log1p(adata.obs['total_counts'])
sc.pl.violin(adata, 'log_counts', groupby='predicted_doublet')
```

## Scanpy Integration Pipeline

**Goal:** Run doublet detection as part of a complete Scanpy preprocessing workflow.

**Approach:** Detect and remove doublets with Scrublet before QC filtering, then proceed through normalization, HVG selection, and clustering.

```python
import scanpy as sc
import scrublet as scr

adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')

adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)

scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
adata.obs['doublet_score'] = doublet_scores
adata.obs['is_doublet'] = predicted_doublets

print(f'Before filtering: {adata.n_obs} cells')
adata = adata[~adata.obs['is_doublet']].copy()
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
print(f'After filtering: {adata.n_obs} cells')

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata)
```

## Seurat Integration Pipeline

**Goal:** Run DoubletFinder as part of a complete Seurat preprocessing workflow.

**Approach:** Preprocess and cluster, run DoubletFinder parameter sweep and classification, filter doublets, then re-preprocess clean singlets.

```r
library(Seurat)
library(DoubletFinder)

seurat_obj <- Read10X('filtered_feature_bc_matrix/')
seurat_obj <- CreateSeuratObject(counts = seurat_obj, min.cells = 3, min.features = 200)

seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-')

seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)

sweep.res <- paramSweep(seurat_obj, PCs = 1:20)
sweep.stats <- summarizeSweep(sweep.res)
bcmvn <- find.pK(sweep.stats)
pk <- as.numeric(as.character(bcmvn$pK[which.max(bcmvn$BCmetric)]))
nExp <- round(0.06 * ncol(seurat_obj))

seurat_obj <- doubletFinder(seurat_obj, PCs = 1:20, pN = 0.25, pK = pk, nExp = nExp)

df_col <- grep('DF.classifications', colnames(seurat_obj@meta.data), value = TRUE)
seurat_obj <- subset(seurat_obj, cells = colnames(seurat_obj)[seurat_obj@meta.data[[df_col]] == 'Singlet'])
seurat_obj <- subset(seurat_obj, subset = percent.mt < 20)

seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
seurat_obj <- FindClusters(seurat_obj)
```

## Method Comparison

| Method | Speed | Accuracy | Language |
|--------|-------|----------|----------|
| Scrublet | Fast | Good | Python |
| DoubletFinder | Slow | Good | R |
| scDblFinder | Fast | Excellent | R |

## Related Skills

- preprocessing - QC before doublet detection
- clustering - Run after filtering doublets
- data-io - Load data before processing

Related Skills

tooluniverse-single-cell

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready single-cell and expression matrix analysis using scanpy, anndata, and scipy. Performs scRNA-seq QC, normalization, PCA, UMAP, Leiden/Louvain clustering, differential expression (Wilcoxon, t-test, DESeq2), cell type annotation, per-cell-type statistical analysis, gene-expression correlation, batch correction (Harmony), trajectory inference, and cell-cell communication analysis. NEW: Analyzes ligand-receptor interactions between cell types using OmniPath (CellPhoneDB, CellChatDB), scores communication strength, identifies signaling cascades, and handles multi-subunit receptor complexes. Integrates with ToolUniverse gene annotation tools (HPA, Ensembl, MyGene, UniProt) and enrichment tools (gseapy, PANTHER, STRING). Supports h5ad, 10X, CSV/TSV count matrices, and pre-annotated datasets. Use when analyzing single-cell RNA-seq data, studying cell-cell interactions, performing cell type differential expression, computing gene-expression correlations by cell type, analyzing tumor-immune communication, or answering questions about scRNA-seq datasets.

tooluniverse-adverse-event-detection

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Detect and analyze adverse drug event signals using FDA FAERS data, drug labels, disproportionality analysis (PRR, ROR, IC), and biomedical evidence. Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, adverse event investigation, and regulatory decision support.

single-trajectory-analysis

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide to reproducing OmicVerse trajectory workflows spanning PAGA, Palantir, VIA, velocity coupling, and fate scoring notebooks.

single2spatial-spatial-mapping

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Map scRNA-seq atlases onto spatial transcriptomics slides using omicverse's Single2Spatial workflow for deep-forest training, spot-level assessment, and marker visualisation.

single-cell-preprocessing-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.

single-cell-multi-omics-integration

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Quick-reference sheet for OmicVerse tutorials spanning MOFA, GLUE pairing, SIMBA integration, TOSICA transfer, and StaVIA cartography.

single-cell-downstream-analysis

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Checklist-style reference for OmicVerse downstream tutorials covering AUCell scoring, metacell DEG, and related exports.

single-cell-clustering-and-batch-correction-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through omicverse's single-cell clustering workflow, covering preprocessing, QC, multimethod clustering, topic modeling, cNMF, and cross-batch integration as demonstrated in t_cluster.ipynb and t_single_batch.ipynb.

single-cell-cellphonedb-communication-mapping

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Run omicverse's CellPhoneDB v5 wrapper on annotated single-cell data to infer ligand-receptor networks and produce CellChat-style visualisations.

single-cell-rna-qc

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

single-cell-annotation-skills-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through SCSA, MetaTiME, CellVote, CellMatch, GPTAnno, and weighted KNN transfer workflows for annotating single-cell modalities.

crisis-detection-intervention-ai

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Detect crisis signals in user content using NLP, mental health sentiment analysis, and safe intervention protocols. Implements suicide ideation detection, automated escalation, and crisis resource integration. Use for mental health apps, recovery platforms, support communities. Activate on "crisis detection", "suicide prevention", "mental health NLP", "intervention protocol". NOT for general sentiment analysis, medical diagnosis, or replacing professional help.