methylation-variability-analysis

This skill provides a complete and streamlined workflow for performing methylation variability and epigenetic heterogeneity analysis from whole-genome bisulfite sequencing (WGBS) data. It is designed for researchers who want to quantify CpG-level variability across biological samples or conditions, identify highly variable CpGs (HVCs), and explore epigenetic heterogeneity.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

methylation-variability-analysis is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using methylation-variability-analysis should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/methylation-variability-analysis/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/methylation-variability-analysis/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/methylation-variability-analysis/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How methylation-variability-analysis Compares

Feature / Agent	methylation-variability-analysis	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# SKILL: Methylation Variability & Heterogeneity Analysis

## Overview

Main steps include:

- Refer to the **Inputs & Outputs** section to check available inputs and design the output structure.
- **Always prompt user** for genome assembly used.
- **Always prompt user** for which columns in the BED files are methylation fraction/percent and coverage and strand.
- Building a multi-sample CpG methylation matrix from WGBS coverage files.
- Computing **between-sample variability** at CpG level (variance, MAD, CV).

---

## When to use this skill

Use this methylKit-based variability pipeline when you want to:

- Quantify **between-sample variability** at CpG level (e.g., across replicates, cell types, conditions).
- Identify **highly variable CpGs (HVCs)** as candidate epigenetically heterogeneous loci.
- Explore **epigenetic heterogeneity** between groups (e.g., GM12878 vs K562, disease vs control).

---

## Inputs & Outputs

### Inputs

`<sample1>.bed`
`<sample2>.bed`

### Outputs
```bash
methylation_variability/
  stats/
    top_variable_CpGs.tsv
    CpG_variability_stats.tsv
  plots/
    heatmap_top_variable_CpGs.pdf
    distribution_CpG_variance.pdf
    mean_vs_variance_scatter.pdf
  temp/
```

---

## Decision Tree

### Step 1: Prepare the sample meta data
```r
library(methylKit)
file.list <- list(
  "sample1.cov",
  "sample2.cov",
  "sample3.cov"
)

sample.id <- list("S1", "S2", "S3")
treatment <- c(0, 1, 1)  # e.g. 0 = control, 1 = treated

# Read methylation data
myobj <- methRead(
  location = file.list,
  sample.id = sample.id,
  assembly  = "hg38", # provided by user
  treatment = treatment,
  context   = "CpG",
  pipeline = list(
    fraction = FALSE,  # percMeth is 0–100, fraction is 0-1, depend on inputs
    chr.col = 1,
    start.col = 2,
    end.col = 3,
    strand.col = 6, # provided by user
    coverage.col = 10, # provided by user
    freqC.col = 11 # provided by user
  )
)

# Optional filtering: remove low / extremely high coverage CpGs
filtered.myobj <- filterByCoverage(
  myobj,
  lo.count = 10, lo.perc = NULL,
  hi.count = 99.9, hi.perc = TRUE
)

# Unite CpGs across samples (common CpG sites)
meth <- unite(filtered.myobj, destrand = TRUE)
```
### Step 2: Statistical analysis

```r
d <- getData(meth.united)
numCs.cols <- grep("numCs", colnames(d), value = TRUE)
cov.cols   <- grep("coverage", colnames(d), value = TRUE)
pmat01 <- d[, numCs.cols] / d[, cov.cols]
pmat01 <- as.matrix(data.frame(pmat01))

var.cpg <- rowVars(pmat01, na.rm = TRUE) # Variance across samples
mad.cpg <- rowMads(pmat01, na.rm = TRUE) # Median absolute deviation (MAD)

# Coefficient of variation (CV = sd / mean)
mean.cpg <- rowMeans(pmat01, na.rm = TRUE)
sd.cpg <- sqrt(var.cpg)
cv.cpg <- sd.cpg / (mean.cpg + 1e-6)  # add small constant to avoid division by zero

# Assemble statistics table
var.stats <- data.frame(
  chr = d$chr,
  start = d$start,
  end = d$end,
  mean = mean.cpg,
  variance = var.cpg,
  MAD = mad.cpg,
  CV = cv.cpg,
  stringsAsFactors = FALSE
)

var.stats <- var.stats[order(-var.stats$variance), ] # Sort by variance (descending)

# Save full table
write.table(
  var.stats,
  file = "CpG_variability_stats.tsv",
  sep = "	",
  quote = FALSE,
  row.names = FALSE
)
```

### Step 3: high variable CpG selection

```r
topN <- 1000
top.idx <- head(order(-var.cpg), topN)

pmat.top <- pmat01[top.idx, , drop = FALSE]

# Save top-variable CpGs table
write.table(
  var.stats[match(rownames(pmat.top), rownames(var.stats)), ],
  file = "top_variable_CpGs.tsv",
  sep = "	",
  quote = FALSE,
  row.names = FALSE
)
```

### Step 4: Visualization

```r
group.factor <- factor(ifelse(treatment == 0, "GM12878", "K562"))
ha <- HeatmapAnnotation(Group = group.factor)

Heatmap(
  pmat.top,
  name = "methylation",
  show_row_names = FALSE,
  show_column_names = TRUE,
  top_annotation = ha,
  cluster_rows = TRUE,
  cluster_columns = TRUE
)

# Distribution of the CpG variability
var.df <- data.frame(
  variance = var.cpg,
  log10_variance = log10(var.cpg + 1e-8)
)

ggplot(var.df, aes(x = log10_variance)) +
  geom_histogram(bins = 50) +
  theme_minimal() +
  labs(
    title = "CpG-wise methylation variance (log10 scale)",
    x = "log10(variance + 1e-8)",
    y = "Count of CpGs"
  )

# 3. Mean vs Variance scatter plot
ggplot(var.stats, aes(x = mean_methylation, y = variance)) +
    geom_hex(bins = 50) +
    scale_fill_viridis_c(trans = "log10") +
    theme_minimal() +
    labs(
      title = "Mean Methylation vs Variance",
      x = "Mean Methylation",
      y = "Variance",
      fill = "Count (log10)"
    ) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
    )
```
---

## Recommended Extensions

- You can change 'lo.count', 'hi.perc', and 'topN' depending on coverage and dataset size.
- If you want group-wise differential variability (e.g., GM12878 vs K562),
- you can apply variance/Bartlett/Levene tests per CpG using 'pmat01' and 'treatment'.
- Add region-level annotation (promoters, gene bodies, CpG islands) using `GenomicRanges` and TxDb annotations, then compute variability at region level by aggregating CpG variability.
- Implement differential variability tests between groups (e.g., variance comparison between GM12878 and K562).
- Combine this variability pipeline with DMR analysis from methylKit to simultaneously look at mean shifts and heterogeneity.

Related Skills

swot-pestle-analysis

from diegosouzapw/awesome-omni-skill

Strategic environmental analysis using SWOT, PESTLE, and Porter's Five Forces. Creates structured assessments with Mermaid visualizations for competitive positioning and strategic planning.

pl-cost-analysis

from diegosouzapw/awesome-omni-skill

Calculate monthly COGS, cost percentages, and manager bonuses (COGS + Top Line) using NET SALES for accuracy and detailed inventory data for restaurant locations.

neuropixels-analysis

from diegosouzapw/awesome-omni-skill

Neuropixels neural recording analysis. Load SpikeGLX/OpenEphys data, preprocess, motion correction, Kilosort4 spike sorting, quality metrics, Allen/IBL curation, AI-assisted visual analysis, for Neuropixels 1.0/2.0 extracellular electrophysiology. Use when working with neural recordings, spike sorting, extracellular electrophysiology, or when the user mentions Neuropixels, SpikeGLX, Open Ephys, Kilosort, quality metrics, or unit curation.

deep-codebase-analysis

from diegosouzapw/awesome-omni-skill

Agent capable of reading and analyzing the entire source code of a software project to gain a thorough understanding of architecture, communication, design patterns, and business flows. Use when exploring new systems, maintenance, or refactoring.

dataql-analysis

from diegosouzapw/awesome-omni-skill

Analyze data files using SQL queries with DataQL. Use when working with CSV, JSON, Parquet, Excel files or when the user mentions data analysis, filtering, aggregation, or SQL queries on files.

bio-methylation-calling

from diegosouzapw/awesome-omni-skill

Extract methylation calls from Bismark BAM files using bismark_methylation_extractor. Generates per-cytosine reports for CpG, CHG, and CHH contexts. Use when extracting methylation levels from aligned bisulfite sequencing data for downstream analysis.

analysis

from diegosouzapw/awesome-omni-skill

Docent is a platform for analyzing AI agent behavior using large language models. Use this skill anytime you want to use Docent to analyze AI agent behavior.

analysis-report

from diegosouzapw/awesome-omni-skill

Generates comprehensive, structured research reports.

azure-ai-vision-imageanalysis-java

from diegosouzapw/awesome-omni-skill

Build image analysis applications with Azure AI Vision SDK for Java. Use when implementing image captioning, OCR text extraction, object detection, tagging, or smart cropping.

article-analysis

from diegosouzapw/awesome-omni-skill

Analyze blog posts and web articles by fetching content from URLs. Use when the user mentions blog post, article, Substack, Medium, web page, newsletter, or provides a URL to analyze.

order-analysis

from diegosouzapw/awesome-omni-skill

分析产品升级工单，识别共性问题并提出产品改进建议。通过 agent-browser工具访问工单系统，提取工单数据，进行问题分类、趋势分析和根因定位，输出改进方案。

agent-ops-git-analysis

from diegosouzapw/awesome-omni-skill

Analyze git repository for insights: contributor stats, commit patterns, branch health, and change analysis. Outputs actionable reports.