bio-metabolomics-normalization-qc
Quality control and normalization for metabolomics data. Covers QC-based correction, batch effect removal, and data transformation methods. Use when correcting technical variation in metabolomics data before statistical analysis.
Best use case
bio-metabolomics-normalization-qc is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Quality control and normalization for metabolomics data. Covers QC-based correction, batch effect removal, and data transformation methods. Use when correcting technical variation in metabolomics data before statistical analysis.
Teams using bio-metabolomics-normalization-qc should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/bio-metabolomics-normalization-qc/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How bio-metabolomics-normalization-qc Compares
| Feature / Agent | bio-metabolomics-normalization-qc | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Quality control and normalization for metabolomics data. Covers QC-based correction, batch effect removal, and data transformation methods. Use when correcting technical variation in metabolomics data before statistical analysis.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
## Version Compatibility
Reference examples tested with: xcms 4.0+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Metabolomics Normalization and QC
## Load and Inspect Data
**Goal:** Load the feature table and sample metadata, separating QC and biological samples for downstream processing.
**Approach:** Read CSV files, partition by sample type, and assess missing value prevalence.
**"Normalize my metabolomics data and correct for batch effects"** → Apply QC-based signal correction, handle missing values, transform intensities, and assess normalization quality via RSD and PCA.
```r
library(tidyverse)
library(pcaMethods)
# Load feature table (samples x features)
data <- read.csv('feature_table.csv', row.names = 1)
sample_info <- read.csv('sample_info.csv')
# Separate QC samples
qc_samples <- sample_info$sample_name[sample_info$sample_type == 'QC']
bio_samples <- sample_info$sample_name[sample_info$sample_type != 'QC']
data_qc <- data[qc_samples, ]
data_bio <- data[bio_samples, ]
# Missing value summary
missing_pct <- colMeans(is.na(data)) * 100
cat('Features with >50% missing:', sum(missing_pct > 50), '\n')
```
## QC-Based Normalization (QC-RSC)
**Goal:** Remove injection-order-dependent signal drift using QC sample trends.
**Approach:** Fit a LOESS curve to QC sample intensities over injection order, then correct all samples by dividing by the predicted drift and rescaling to the QC median.
```r
# QC-based Robust Spline Correction
library(statTarget)
qc_rsc_normalize <- function(data, sample_info) {
# Fit LOESS to QC samples over injection order
# Correct biological samples based on QC trend
injection_order <- sample_info$injection_order
is_qc <- sample_info$sample_type == 'QC'
normalized <- data
for (feature in colnames(data)) {
qc_values <- data[is_qc, feature]
qc_order <- injection_order[is_qc]
# Fit LOESS
fit <- loess(qc_values ~ qc_order, span = 0.75)
# Predict for all samples
predicted <- predict(fit, injection_order)
# Correct: divide by trend, multiply by median
median_val <- median(qc_values, na.rm = TRUE)
normalized[, feature] <- data[, feature] / predicted * median_val
}
return(normalized)
}
data_corrected <- qc_rsc_normalize(data, sample_info)
```
## Total Ion Current (TIC) Normalization
**Goal:** Correct for differences in total signal intensity across samples.
**Approach:** Divide each sample by its total intensity sum, then rescale to the median total intensity.
```r
# Simple sum normalization
tic_normalize <- function(data) {
row_sums <- rowSums(data, na.rm = TRUE)
normalized <- data / row_sums * median(row_sums)
return(normalized)
}
data_tic <- tic_normalize(data)
```
## Probabilistic Quotient Normalization (PQN)
**Goal:** Normalize samples while being robust to large fold changes in individual features.
**Approach:** Compute a reference spectrum from sample medians, calculate per-sample quotients, and divide each sample by its median quotient.
```r
pqn_normalize <- function(data) {
# Calculate reference spectrum (median of all samples)
reference <- apply(data, 2, median, na.rm = TRUE)
# Calculate quotients
quotients <- data / reference
# Normalization factor = median of quotients per sample
factors <- apply(quotients, 1, median, na.rm = TRUE)
# Normalize
normalized <- data / factors
return(normalized)
}
data_pqn <- pqn_normalize(data)
```
## Batch Correction (ComBat)
**Goal:** Remove systematic technical variation between processing batches while preserving biological effects.
**Approach:** Apply ComBat empirical Bayes batch correction on log-transformed data, using a design matrix to protect the biological variable of interest.
```r
library(sva)
# ComBat for batch correction
batch <- sample_info$batch
mod <- model.matrix(~ sample_info$group) # Keep biological effect
# Log transform first
data_log <- log2(data + 1)
# Apply ComBat
data_combat <- ComBat(dat = t(data_log), batch = batch, mod = mod)
data_combat <- t(data_combat)
```
## Missing Value Handling
**Goal:** Filter features with excessive missing values and impute remaining gaps for complete-case analysis.
**Approach:** Remove features missing in more than 20% of samples (optionally per group), then impute via KNN or minimum-value replacement for left-censored data.
```r
# Filter features with too many missing values
filter_missing <- function(data, max_missing = 0.2, by_group = TRUE, groups = NULL) {
if (by_group && !is.null(groups)) {
# Keep if present in >80% of samples in at least one group
keep <- sapply(colnames(data), function(f) {
any(sapply(unique(groups), function(g) {
group_data <- data[groups == g, f]
mean(is.na(group_data)) <= max_missing
}))
})
} else {
keep <- colMeans(is.na(data)) <= max_missing
}
return(data[, keep])
}
data_filtered <- filter_missing(data, max_missing = 0.2, by_group = TRUE,
groups = sample_info$group)
# Impute remaining missing values
# KNN imputation
library(impute)
data_imputed <- impute.knn(as.matrix(data_filtered), k = 5)$data
# Or minimum value imputation (for left-censored data)
min_impute <- function(data) {
data_imp <- data
for (col in colnames(data)) {
min_val <- min(data[, col], na.rm = TRUE) / 2
data_imp[is.na(data_imp[, col]), col] <- min_val
}
return(data_imp)
}
```
## Data Transformation
**Goal:** Transform and scale feature intensities to approximate normality and equalize feature variance.
**Approach:** Apply log2 transformation followed by Pareto scaling (divide by sqrt of SD) or auto-scaling (z-score).
```r
# Log transformation
data_log <- log2(data + 1)
# Pareto scaling (mean-centered, divided by sqrt of SD)
pareto_scale <- function(data) {
centered <- scale(data, center = TRUE, scale = FALSE)
scaled <- centered / sqrt(apply(data, 2, sd, na.rm = TRUE))
return(scaled)
}
data_pareto <- pareto_scale(data_log)
# Auto-scaling (z-score)
data_auto <- scale(data_log)
```
## QC Assessment
**Goal:** Evaluate normalization success by measuring QC sample reproducibility and visualizing sample clustering.
**Approach:** Calculate relative standard deviation (RSD) across QC samples (target <30%) and compare PCA before and after correction.
```r
# RSD in QC samples (should be <30%)
qc_rsd <- function(data, qc_samples) {
qc_data <- data[qc_samples, ]
rsd <- apply(qc_data, 2, function(x) sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) * 100)
return(rsd)
}
rsd_before <- qc_rsd(data, qc_samples)
rsd_after <- qc_rsd(data_corrected, qc_samples)
cat('Features with RSD <30% before:', sum(rsd_before < 30, na.rm = TRUE), '\n')
cat('Features with RSD <30% after:', sum(rsd_after < 30, na.rm = TRUE), '\n')
# PCA to check correction
pca_before <- prcomp(t(na.omit(data)), scale. = TRUE)
pca_after <- prcomp(t(na.omit(data_corrected)), scale. = TRUE)
# Plot
par(mfrow = c(1, 2))
plot(pca_before$rotation[, 1:2], col = ifelse(rownames(pca_before$rotation) %in% qc_samples, 'red', 'blue'),
main = 'Before correction', pch = 16)
plot(pca_after$rotation[, 1:2], col = ifelse(rownames(pca_after$rotation) %in% qc_samples, 'red', 'blue'),
main = 'After correction', pch = 16)
```
## Quality Report
**Goal:** Generate a summary report of key QC metrics for the processed dataset.
**Approach:** Compute feature count, sample count, missing percentage, median RSD, and features passing RSD threshold.
```r
generate_qc_report <- function(data, sample_info) {
qc_samples <- sample_info$sample_name[sample_info$sample_type == 'QC']
report <- list(
n_features = ncol(data),
n_samples = nrow(data),
n_qc = length(qc_samples),
missing_pct = mean(is.na(data)) * 100,
qc_rsd_median = median(qc_rsd(data, qc_samples), na.rm = TRUE),
features_rsd_lt30 = sum(qc_rsd(data, qc_samples) < 30, na.rm = TRUE)
)
cat('=== QC Report ===\n')
for (name in names(report)) {
cat(sprintf('%s: %s\n', name, round(report[[name]], 2)))
}
return(report)
}
report <- generate_qc_report(data_corrected, sample_info)
```
## Related Skills
- xcms-preprocessing - Generate feature table
- statistical-analysis - Downstream analysis
- differential-expression/batch-correction - Similar conceptsRelated Skills
tooluniverse-metabolomics
Comprehensive metabolomics research skill for identifying metabolites, analyzing studies, and searching metabolomics databases. Integrates HMDB (220k+ metabolites), MetaboLights, Metabolomics Workbench, and PubChem. Use when asked to identify or annotate metabolites (HMDB IDs, chemical properties, pathways), retrieve metabolomics study information from MetaboLights (MTBLS*) or Metabolomics Workbench (ST*), search for studies by keywords or disease, or generate comprehensive metabolomics research reports.
tooluniverse-metabolomics-analysis
Analyze metabolomics data including metabolite identification, quantification, pathway analysis, and metabolic flux. Processes LC-MS, GC-MS, NMR data from targeted and untargeted experiments. Performs normalization, statistical analysis, pathway enrichment, metabolite-enzyme integration, and biomarker discovery. Use when analyzing metabolomics datasets, identifying differential metabolites, studying metabolic pathways, integrating with transcriptomics/proteomics, discovering metabolic biomarkers, performing flux balance analysis, or characterizing metabolic phenotypes in disease, drug response, or physiological conditions.
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-variant-normalization
Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis.
bio-metabolomics-xcms-preprocessing
XCMS3 workflow for LC-MS/MS metabolomics preprocessing. Covers peak detection, retention time alignment, correspondence (grouping), and gap filling. Use when processing raw LC-MS data into a feature table for untargeted metabolomics.
bio-metabolomics-targeted-analysis
Targeted metabolomics analysis using MRM/SRM with standard curves. Covers absolute quantification, method validation, and quality assessment. Use when quantifying specific metabolites using calibration curves and internal standards.
bio-metabolomics-statistical-analysis
Statistical analysis for metabolomics data. Covers univariate testing, multivariate methods (PCA, PLS-DA), and biomarker discovery. Use when identifying differentially abundant metabolites or building classification models.
bio-metabolomics-pathway-mapping
Map metabolites to biological pathways using KEGG, Reactome, and MetaboAnalyst. Perform pathway enrichment and topology analysis. Use when interpreting metabolomics results in the context of biochemical pathways.
bio-metabolomics-msdial-preprocessing
MS-DIAL-based metabolomics preprocessing as alternative to XCMS. Covers peak detection, alignment, annotation, and export for downstream analysis. Use when processing MS-DIAL output files for R/Python analysis or when preferring GUI-based preprocessing.
bio-metabolomics-metabolite-annotation
Metabolite identification from m/z and retention time. Covers database matching, MS/MS spectral matching, and confidence level assignment. Use when assigning compound identities to detected features in untargeted metabolomics.
bio-metabolomics-lipidomics
Specialized lipidomics analysis for lipid identification, quantification, and pathway interpretation. Covers LC-MS lipidomics with LipidSearch, MS-DIAL, and LipidMaps annotation. Use when analyzing lipid classes, chain composition, or lipid-specific pathways.
bio-flow-cytometry-bead-normalization
Bead-based normalization for CyTOF and high-parameter flow cytometry. Covers EQ bead normalization, signal drift correction, and batch normalization. Use when correcting instrument drift in CyTOF or harmonizing data across batches.