bio-crispr-screens-batch-correction
Batch effect correction for CRISPR screens. Covers normalization across batches, technical replicate handling, and batch-aware analysis. Use when combining screens from multiple batches or correcting systematic technical variation.
Best use case
bio-crispr-screens-batch-correction is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Batch effect correction for CRISPR screens. Covers normalization across batches, technical replicate handling, and batch-aware analysis. Use when combining screens from multiple batches or correcting systematic technical variation.
Teams using bio-crispr-screens-batch-correction should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/bio-crispr-screens-batch-correction/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How bio-crispr-screens-batch-correction Compares
| Feature / Agent | bio-crispr-screens-batch-correction | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Batch effect correction for CRISPR screens. Covers normalization across batches, technical replicate handling, and batch-aware analysis. Use when combining screens from multiple batches or correcting systematic technical variation.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
## Version Compatibility
Reference examples tested with: DESeq2 1.42+, MAGeCK 0.5+, matplotlib 3.8+, numpy 1.26+, pandas 2.2+, scikit-learn 1.4+, scipy 1.12+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Batch Correction
**"Correct batch effects in my CRISPR screens"** → Normalize and harmonize sgRNA count data across screen batches to remove systematic technical variation while preserving biological signal.
- Python: `scipy`/`sklearn` for median normalization and batch correction
- CLI: `mageck test` with batch-aware design
## Median Normalization
**Goal:** Remove systematic library-size differences between batches.
**Approach:** Scale each sample within a batch so that sample medians match a global median, correcting for sequencing depth variation.
```python
import numpy as np
import pandas as pd
from scipy import stats
def median_normalize(counts_df, batch_column='batch'):
'''Normalize counts to median within each batch.'''
normalized = counts_df.copy()
guide_columns = [c for c in counts_df.columns if c not in [batch_column, 'gene', 'guide']]
for batch in counts_df[batch_column].unique():
batch_mask = counts_df[batch_column] == batch
batch_data = counts_df.loc[batch_mask, guide_columns]
sample_medians = batch_data.median(axis=0)
global_median = sample_medians.median()
scale_factors = global_median / sample_medians
normalized.loc[batch_mask, guide_columns] = batch_data * scale_factors
return normalized
counts_df = pd.read_csv('screen_counts.csv')
normalized = median_normalize(counts_df, 'batch')
```
## Size Factor Normalization
```python
def size_factor_normalize(counts_df, reference='geometric_mean'):
'''DESeq2-style size factor normalization.'''
guide_cols = [c for c in counts_df.columns if c.startswith('sample_')]
counts = counts_df[guide_cols].values
counts_nonzero = np.where(counts == 0, np.nan, counts)
if reference == 'geometric_mean':
log_counts = np.log(counts_nonzero)
geometric_mean = np.exp(np.nanmean(log_counts, axis=1))
else:
geometric_mean = counts_nonzero.mean(axis=1)
ratios = counts_nonzero / geometric_mean[:, np.newaxis]
size_factors = np.nanmedian(ratios, axis=0)
normalized_counts = counts / size_factors
normalized_df = counts_df.copy()
normalized_df[guide_cols] = normalized_counts
return normalized_df, size_factors
normalized, size_factors = size_factor_normalize(counts_df)
print('Size factors:', size_factors)
```
## Quantile Normalization
```python
def quantile_normalize(counts_df, guide_cols=None):
'''Quantile normalization across samples.'''
if guide_cols is None:
guide_cols = [c for c in counts_df.columns if c.startswith('sample_')]
data = counts_df[guide_cols].values.copy()
sorted_data = np.sort(data, axis=0)
mean_values = sorted_data.mean(axis=1)
ranks = np.argsort(np.argsort(data, axis=0), axis=0)
normalized = mean_values[ranks]
result = counts_df.copy()
result[guide_cols] = normalized
return result
qn_counts = quantile_normalize(counts_df)
```
## Control-Based Normalization
```python
def normalize_to_controls(counts_df, control_genes, method='median'):
'''Normalize using non-targeting or negative control guides.'''
guide_cols = [c for c in counts_df.columns if c.startswith('sample_')]
is_control = counts_df['gene'].isin(control_genes)
control_data = counts_df.loc[is_control, guide_cols]
if method == 'median':
control_values = control_data.median(axis=0)
elif method == 'mean':
control_values = control_data.mean(axis=0)
elif method == 'sum':
control_values = control_data.sum(axis=0)
reference = control_values.median()
scale_factors = reference / control_values
normalized = counts_df.copy()
normalized[guide_cols] = counts_df[guide_cols] * scale_factors
return normalized, scale_factors
nontargeting = counts_df[counts_df['gene'].str.startswith('NonTargeting')]['gene'].unique()
normalized, factors = normalize_to_controls(counts_df, nontargeting)
```
## Batch Effect Removal with ComBat
**Goal:** Remove batch effects using empirical Bayes adjustment while preserving biological signal.
**Approach:** Log-transform counts, apply pyCombat with a batch vector, and back-transform to count space.
```python
def combat_correction(counts_df, batch_vector, guide_cols=None):
'''ComBat batch correction for count data.'''
from combat.pycombat import pycombat
if guide_cols is None:
guide_cols = [c for c in counts_df.columns if c.startswith('sample_')]
data = counts_df[guide_cols].values.T
log_data = np.log2(data + 1)
corrected = pycombat(log_data, batch_vector)
corrected_counts = np.power(2, corrected) - 1
corrected_counts = np.maximum(corrected_counts, 0)
result = counts_df.copy()
result[guide_cols] = corrected_counts.T
return result
batches = [1, 1, 1, 2, 2, 2]
corrected = combat_correction(counts_df, batches)
```
## Batch-Aware Log-Fold Change
```python
def batch_aware_lfc(counts_df, treatment_cols, control_cols, batch_vector):
'''Calculate LFC accounting for batch structure.'''
batches = np.unique(batch_vector)
lfc_by_batch = []
for batch in batches:
batch_treat = [c for c, b in zip(treatment_cols, batch_vector) if b == batch and c in treatment_cols]
batch_ctrl = [c for c, b in zip(control_cols, batch_vector) if b == batch and c in control_cols]
if len(batch_treat) == 0 or len(batch_ctrl) == 0:
continue
treat_mean = counts_df[batch_treat].mean(axis=1)
ctrl_mean = counts_df[batch_ctrl].mean(axis=1)
batch_lfc = np.log2((treat_mean + 1) / (ctrl_mean + 1))
lfc_by_batch.append(batch_lfc)
combined_lfc = pd.concat(lfc_by_batch, axis=1).mean(axis=1)
lfc_var = pd.concat(lfc_by_batch, axis=1).var(axis=1)
return combined_lfc, lfc_var
```
## Replicate Correlation Check
```python
def check_replicate_correlation(counts_df, sample_cols, replicate_groups):
'''Check correlation between replicates.'''
correlations = []
for group, replicates in replicate_groups.items():
if len(replicates) < 2:
continue
for i in range(len(replicates)):
for j in range(i+1, len(replicates)):
r1, r2 = replicates[i], replicates[j]
if r1 in sample_cols and r2 in sample_cols:
log_r1 = np.log2(counts_df[r1] + 1)
log_r2 = np.log2(counts_df[r2] + 1)
corr, pval = stats.pearsonr(log_r1, log_r2)
correlations.append({
'group': group,
'rep1': r1,
'rep2': r2,
'pearson_r': corr,
'pvalue': pval
})
return pd.DataFrame(correlations)
replicate_groups = {
'treatment_batch1': ['sample_1', 'sample_2'],
'treatment_batch2': ['sample_4', 'sample_5'],
'control_batch1': ['sample_3'],
'control_batch2': ['sample_6']
}
corr_df = check_replicate_correlation(counts_df, counts_df.columns[3:], replicate_groups)
print(corr_df)
```
## Batch QC Metrics
**Goal:** Quantify batch effect magnitude to determine whether correction is needed.
**Approach:** Run PCA on log-transformed counts, compute between-batch vs within-batch variance ratio, and assess whether batch structure dominates the first principal components.
```python
def batch_qc_metrics(counts_df, batch_vector, sample_cols):
'''Calculate batch-related QC metrics.'''
from sklearn.decomposition import PCA
from scipy.spatial.distance import pdist
log_counts = np.log2(counts_df[sample_cols].values.T + 1)
pca = PCA(n_components=min(5, len(sample_cols)))
pcs = pca.fit_transform(log_counts)
batch_labels = np.array(batch_vector)
unique_batches = np.unique(batch_labels)
if len(unique_batches) > 1:
batch_means = [pcs[batch_labels == b].mean(axis=0) for b in unique_batches]
batch_separation = np.mean(pdist(batch_means))
within_batch_var = np.mean([pcs[batch_labels == b].var() for b in unique_batches])
between_batch_var = np.var(batch_means, axis=0).sum()
batch_effect_ratio = between_batch_var / (within_batch_var + 1e-10)
else:
batch_separation = 0
batch_effect_ratio = 0
return {
'batch_separation': batch_separation,
'batch_effect_ratio': batch_effect_ratio,
'pca_variance_explained': pca.explained_variance_ratio_,
'n_batches': len(unique_batches)
}
qc = batch_qc_metrics(counts_df, [1,1,1,2,2,2], sample_cols)
print(f"Batch effect ratio: {qc['batch_effect_ratio']:.2f}")
```
## Visualization
```python
import matplotlib.pyplot as plt
def plot_batch_effect(counts_df, batch_vector, sample_cols, output_file):
'''Visualize batch effects with PCA.'''
from sklearn.decomposition import PCA
log_counts = np.log2(counts_df[sample_cols].values.T + 1)
pca = PCA(n_components=2)
pcs = pca.fit_transform(log_counts)
fig, ax = plt.subplots(figsize=(8, 6))
for batch in np.unique(batch_vector):
mask = np.array(batch_vector) == batch
ax.scatter(pcs[mask, 0], pcs[mask, 1], label=f'Batch {batch}', s=100)
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
ax.legend()
ax.set_title('PCA - Batch Effects')
plt.tight_layout()
plt.savefig(output_file, dpi=150)
plt.close()
plot_batch_effect(counts_df, [1,1,1,2,2,2], sample_cols, 'batch_pca.png')
```
## Related Skills
- mageck-analysis - Batch-aware MAGeCK analysis
- screen-qc - Quality control before correction
- hit-calling - Analysis after batch correction
- library-design - Control guide designRelated Skills
tooluniverse-crispr-screen-analysis
Comprehensive CRISPR screen analysis for functional genomics. Analyze pooled or arrayed CRISPR screens (knockout, activation, interference) to identify essential genes, synthetic lethal interactions, and drug targets. Perform sgRNA count processing, gene-level scoring (MAGeCK, BAGEL), quality control, pathway enrichment, and drug target prioritization. Use for CRISPR screen analysis, gene essentiality studies, synthetic lethality detection, functional genomics, drug target validation, or identifying genetic vulnerabilities.
single-cell-clustering-and-batch-correction-with-omicverse
Guide Claude through omicverse's single-cell clustering workflow, covering preprocessing, QC, multimethod clustering, topic modeling, cNMF, and cross-batch integration as demonstrated in t_cluster.ipynb and t_single_batch.ipynb.
bulk-rna-seq-batch-correction-with-combat
Use omicverse's pyComBat wrapper to remove batch effects from merged bulk RNA-seq or microarray cohorts, export corrected matrices, and benchmark pre/post correction visualisations.
bio-single-cell-batch-integration
Integrate multiple scRNA-seq samples/batches using Harmony, scVI, Seurat anchors, and fastMNN. Remove technical variation while preserving biological differences. Use when integrating multiple scRNA-seq batches or datasets.
bio-differential-expression-batch-correction
Remove batch effects from RNA-seq data using ComBat, ComBat-Seq, limma removeBatchEffect, and SVA for unknown batch variables. Use when correcting batch effects in expression data.
bio-crispr-screens-screen-qc
Quality control for pooled CRISPR screens. Covers library representation, read distribution, replicate correlation, and essential gene recovery. Use when assessing screen quality before hit calling or diagnosing poor screen performance.
bio-crispr-screens-mageck-analysis
MAGeCK (Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout) for pooled CRISPR screen analysis. Covers count normalization, gene ranking, and pathway analysis. Use when identifying essential genes, drug targets, or resistance mechanisms from dropout or enrichment screens.
bio-crispr-screens-library-design
CRISPR library design for genetic screens. Covers sgRNA selection, library composition, control design, and oligo ordering. Use when designing custom sgRNA libraries for knockout, activation, or interference screens.
bio-crispr-screens-jacks-analysis
JACKS (Joint Analysis of CRISPR/Cas9 Knockout Screens) for modeling sgRNA efficacy and gene essentiality. Use when analyzing multiple CRISPR screens simultaneously or when accounting for variable sgRNA efficiency across experiments.
bio-crispr-screens-hit-calling
Statistical methods for calling hits in CRISPR screens. Covers MAGeCK, BAGEL2, drugZ, and custom approaches for identifying essential and resistance genes. Use when identifying significant genes from screen count data after QC passes.
bio-crispr-screens-crispresso-editing
CRISPResso2 for analyzing CRISPR gene editing outcomes. Quantifies indels, HDR efficiency, and generates comprehensive editing reports. Use when analyzing amplicon sequencing data from CRISPR editing experiments to assess editing efficiency.
bio-crispr-screens-base-editing-analysis
Analyzes base editing and prime editing outcomes including editing efficiency, bystander edits, and indel frequencies. Use when quantifying CRISPR base editor results, comparing ABE vs CBE efficiency, or assessing prime editing fidelity.