bio-differential-expression-batch-correction

Remove batch effects from RNA-seq data using ComBat, ComBat-Seq, limma removeBatchEffect, and SVA for unknown batch variables. Use when correcting batch effects in expression data.

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

bio-differential-expression-batch-correction is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Remove batch effects from RNA-seq data using ComBat, ComBat-Seq, limma removeBatchEffect, and SVA for unknown batch variables. Use when correcting batch effects in expression data.

Teams using bio-differential-expression-batch-correction should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-differential-expression-batch-correction/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-differential-expression-batch-correction/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bio-differential-expression-batch-correction/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bio-differential-expression-batch-correction Compares

Feature / Agent	bio-differential-expression-batch-correction	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Remove batch effects from RNA-seq data using ComBat, ComBat-Seq, limma removeBatchEffect, and SVA for unknown batch variables. Use when correcting batch effects in expression data.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Version Compatibility

Reference examples tested with: DESeq2 1.42+, ggplot2 3.5+, limma 3.58+, scanpy 1.10+

Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Batch Effect Correction

## ComBat-Seq (Count Data)

**Goal:** Remove batch effects from raw count data while preserving biological group differences.

**Approach:** Apply ComBat-Seq's negative binomial regression to adjust counts, keeping the integer nature of the data.

**"Remove batch effects from my RNA-seq counts"** → Adjust raw count matrix for known batch labels using negative binomial modeling, preserving biological condition effects.

```r
library(sva)

# counts: raw count matrix (genes x samples)
# batch: vector of batch labels
# group: vector of biological condition (optional, to preserve)

corrected_counts <- ComBat_seq(counts = as.matrix(counts),
                                batch = batch,
                                group = condition,
                                full_mod = TRUE)

# Result is batch-corrected count matrix
# Use for visualization, clustering, but NOT for DE (use design formula instead)
```

## ComBat (Normalized Data)

**Goal:** Remove batch effects from normalized (log-transformed or TPM) expression data.

**Approach:** Apply parametric empirical Bayes adjustment to normalized expression while protecting biological covariates.

```r
library(sva)

# For normalized expression (log-transformed, TPM, etc.)
# NOT for raw counts

# Create model matrix
mod <- model.matrix(~ condition, data = metadata)
mod0 <- model.matrix(~ 1, data = metadata)

# Run ComBat
corrected_expr <- ComBat(dat = as.matrix(normalized_expr),
                          batch = metadata$batch,
                          mod = mod,
                          par.prior = TRUE)
```

## limma removeBatchEffect

**Goal:** Produce batch-corrected expression values for visualization while preserving group differences.

**Approach:** Regress out the batch effect from normalized expression using limma's linear model.

```r
library(limma)

# For visualization/clustering only
# Preserves group differences while removing batch

design <- model.matrix(~ condition, data = metadata)
corrected_expr <- removeBatchEffect(normalized_expr,
                                     batch = metadata$batch,
                                     design = design)

# For PCA, heatmaps, etc.
```

## DESeq2 Design Formula (Recommended for DE)

**Goal:** Account for batch effects during DE testing without modifying the count data.

**Approach:** Include batch as a covariate in the DESeq2 design formula so batch variance is modeled, not removed.

```r
library(DESeq2)

# Include batch in design formula - preferred for DE analysis
dds <- DESeqDataSetFromMatrix(countData = counts,
                               colData = metadata,
                               design = ~ batch + condition)

# Batch is modeled, not removed
# DE results are adjusted for batch
dds <- DESeq(dds)
res <- results(dds, contrast = c('condition', 'treatment', 'control'))
```

## Surrogate Variable Analysis (SVA)

**Goal:** Discover and correct for unknown sources of variation (hidden batch effects).

**Approach:** Estimate surrogate variables from the residual variation not explained by the biological model.

**"Correct for unknown batch effects in my expression data"** → Estimate latent surrogate variables capturing unwanted variation, then include them as covariates in the DE model.

```r
library(sva)

# When batch is unknown, estimate surrogate variables
mod <- model.matrix(~ condition, data = metadata)
mod0 <- model.matrix(~ 1, data = metadata)

# Estimate number of surrogate variables
n_sv <- num.sv(normalized_expr, mod, method = 'leek')

# Estimate surrogate variables
svobj <- sva(normalized_expr, mod, mod0, n.sv = n_sv)

# Add SVs to design for DE
design_with_sv <- cbind(mod, svobj$sv)
```

## SVA with DESeq2

**Goal:** Integrate surrogate variables into DESeq2 to adjust for hidden confounders during DE testing.

**Approach:** Estimate SVs from normalized counts, add them to colData, and update the design formula.

```r
library(DESeq2)
library(sva)

# Normalize for SV estimation
dds <- DESeqDataSetFromMatrix(countData = counts, colData = metadata, design = ~ condition)
dds <- estimateSizeFactors(dds)
norm_counts <- counts(dds, normalized = TRUE)

# Estimate SVs
mod <- model.matrix(~ condition, data = metadata)
mod0 <- model.matrix(~ 1, data = metadata)
svobj <- sva(norm_counts, mod, mod0)

# Add SVs to colData
for (i in seq_len(ncol(svobj$sv))) {
    colData(dds)[[paste0('SV', i)]] <- svobj$sv[, i]
}

# Update design
sv_formula <- as.formula(paste('~', paste(paste0('SV', 1:ncol(svobj$sv)), collapse = ' + '), '+ condition'))
design(dds) <- sv_formula

# Run DESeq2
dds <- DESeq(dds)
```

## Visualize Batch Effects

**Goal:** Confirm batch effect removal by comparing PCA plots before and after correction.

**Approach:** Run PCA on pre- and post-correction expression, coloring points by batch and condition.

```r
library(ggplot2)

# PCA before correction
pca_before <- prcomp(t(normalized_expr), scale. = TRUE)
pca_df <- data.frame(PC1 = pca_before$x[, 1], PC2 = pca_before$x[, 2],
                     batch = metadata$batch, condition = metadata$condition)

p1 <- ggplot(pca_df, aes(PC1, PC2, color = batch, shape = condition)) +
    geom_point(size = 3) + ggtitle('Before Correction')

# PCA after correction
pca_after <- prcomp(t(corrected_expr), scale. = TRUE)
pca_df_after <- data.frame(PC1 = pca_after$x[, 1], PC2 = pca_after$x[, 2],
                           batch = metadata$batch, condition = metadata$condition)

p2 <- ggplot(pca_df_after, aes(PC1, PC2, color = batch, shape = condition)) +
    geom_point(size = 3) + ggtitle('After Correction')

library(patchwork)
p1 + p2
```

## Quantify Batch Effect

**Goal:** Measure the proportion of variance attributable to batch versus biological condition.

**Approach:** Correlate principal components with batch and condition labels, or use PVCA.

```r
# PVCA - Principal Variance Component Analysis
library(pvca)

# Proportion of variance explained by batch vs condition
pvcaObj <- pvcaBatchAssess(normalized_expr, metadata, threshold = 0.6,
                            theInteractionTerms = c('batch', 'condition'))

# Or manual approach
pca <- prcomp(t(normalized_expr), scale. = TRUE)
variance_explained <- summary(pca)$importance[2, 1:5]

# Correlation of PCs with batch
cor(pca$x[, 1], as.numeric(as.factor(metadata$batch)))
```

## Harmony (Single-Cell Integration)

**Goal:** Integrate single-cell data from multiple batches into a shared embedding.

**Approach:** Apply Harmony to PCA embeddings to iteratively remove batch effects while preserving cell-type structure.

```r
library(harmony)
library(Seurat)

# For single-cell data with multiple batches
seurat_obj <- RunHarmony(seurat_obj, group.by.vars = 'batch', reduction = 'pca',
                          dims.use = 1:30)

# Use harmony reduction for downstream
seurat_obj <- RunUMAP(seurat_obj, reduction = 'harmony', dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction = 'harmony', dims = 1:30)
```

## When NOT to Correct

```r
# DON'T use batch-corrected values for:
# - Differential expression (use design formula instead)
# - Count-based methods expecting raw/normalized counts

# DO use batch-corrected values for:
# - Visualization (PCA, UMAP, heatmaps)
# - Clustering
# - Machine learning features
# - Cross-study comparisons
```

## Related Skills

- differential-expression/deseq2-basics - DE with batch in design
- single-cell/clustering - Integration methods
- expression-matrix/matrix-operations - Data transformation

Related Skills

tooluniverse-expression-data-retrieval

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Retrieves gene expression and omics datasets from ArrayExpress and BioStudies with gene disambiguation, experiment quality assessment, and structured reports. Creates comprehensive dataset profiles with metadata, sample information, and download links. Use when users need expression data, omics datasets, or mention ArrayExpress (E-MTAB, E-GEOD) or BioStudies (S-BSST) accessions.

single-cell-clustering-and-batch-correction-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through omicverse's single-cell clustering workflow, covering preprocessing, QC, multimethod clustering, topic modeling, cNMF, and cross-batch integration as demonstrated in t_cluster.ipynb and t_single_batch.ipynb.

cell-free-expression

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guidance for cell-free protein synthesis (CFPS) optimization. Use when: (1) Planning CFPS experiments, (2) Troubleshooting low yield or aggregation, (3) Optimizing DNA template design for CFPS, (4) Expressing difficult proteins (disulfide-rich, toxic, membrane).

bulk-rna-seq-differential-expression-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through omicverse's bulk RNA-seq DEG pipeline, from gene ID mapping and DESeq2 normalization to statistical testing, visualization, and pathway enrichment. Use when a user has bulk count matrices and needs differential expression analysis in omicverse.

bulk-rna-seq-batch-correction-with-combat

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use omicverse's pyComBat wrapper to remove batch effects from merged bulk RNA-seq or microarray cohorts, export corrected matrices, and benchmark pre/post correction visualisations.

bio-single-cell-batch-integration

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Integrate multiple scRNA-seq samples/batches using Harmony, scVI, Seurat anchors, and fastMNN. Remove technical variation while preserving biological differences. Use when integrating multiple scRNA-seq batches or datasets.

bio-proteomics-differential-abundance

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Statistical testing for differentially abundant proteins between conditions. Covers limma and MSstats workflows with multiple testing correction. Use when identifying proteins with significant abundance changes between experimental groups.

bio-microbiome-differential-abundance

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Differential abundance testing for microbiome data using compositionally-aware methods like ALDEx2, ANCOM-BC2, and MaAsLin2. Use when identifying taxa that differ between experimental groups while accounting for the compositional nature of microbiome data.

bio-hi-c-analysis-hic-differential

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Compare Hi-C contact matrices between conditions to identify differential chromatin interactions. Compute log2 fold changes, statistical significance, and visualize differential contact maps. Use when comparing Hi-C contacts between conditions.

bio-flow-cytometry-differential-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Differential abundance and state analysis for cytometry data. Compare cell populations between conditions using statistical methods. Use when testing for significant changes in cell frequencies or marker expression between groups.

bio-differential-splicing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Detects differential alternative splicing between conditions using rMATS-turbo (BAM-based) or SUPPA2 diffSplice (TPM-based). Reports events with FDR-corrected significance and delta PSI effect sizes. Use when comparing splicing patterns between treatment groups, tissues, or disease states.

bio-differential-expression-timeseries-de

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Analyze time-series RNA-seq data using limma voom with splines, maSigPro, and ImpulseDE2. Identify genes with dynamic expression patterns. Use when analyzing time-series or longitudinal expression data.