tooluniverse-epigenomics

Production-ready genomics and epigenomics data processing for BixBench questions. Handles methylation array analysis (CpG filtering, differential methylation, age-related CpG detection, chromosome-level density), ChIP-seq peak analysis (peak calling, motif enrichment, coverage stats), ATAC-seq chromatin accessibility, multi-omics integration (expression + methylation correlation), and genome-wide statistics. Pure Python computation (pandas, scipy, numpy, pysam, statsmodels) plus ToolUniverse annotation tools (Ensembl, ENCODE, SCREEN, JASPAR, ReMap, RegulomeDB, ChIPAtlas). Supports BED, BigWig, methylation beta-value matrices, Illumina manifest files, and multi-sample clinical data. Use when processing methylation data, ChIP-seq peaks, ATAC-seq signals, or answering questions about CpG sites, differential methylation, chromatin accessibility, histone marks, or epigenomic statistics.

42 stars

byZaoqu-Liu

View on GitHub Installation ↓

Best use case

tooluniverse-epigenomics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using tooluniverse-epigenomics should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/tooluniverse-epigenomics/SKILL.md --create-dirs "https://raw.githubusercontent.com/Zaoqu-Liu/ScienceClaw/main/skills/tooluniverse-epigenomics/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/tooluniverse-epigenomics/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How tooluniverse-epigenomics Compares

Feature / Agent	tooluniverse-epigenomics	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Genomics and Epigenomics Data Processing

Production-ready computational skill for processing and analyzing epigenomics data. Combines local Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for regulatory context. Designed to solve BixBench-style questions about methylation, ChIP-seq, ATAC-seq, and multi-omics integration.

## When to Use This Skill

**Triggers**:
- User provides methylation data (beta-value matrices, Illumina arrays) and asks about CpG sites
- Questions about differential methylation analysis
- Age-related CpG detection or epigenetic clock questions
- Chromosome-level methylation density or statistics
- ChIP-seq peak files (BED format) with analysis questions
- ATAC-seq chromatin accessibility questions
- Multi-omics integration (expression + methylation, expression + ChIP-seq)
- Genome-wide epigenomic statistics
- Questions mentioning "methylation", "CpG", "ChIP-seq", "ATAC-seq", "histone", "chromatin", "epigenetic"
- Questions about missing data across clinical/genomic/epigenomic modalities
- Regulatory element annotation for processed epigenomic data

**NOT for** (use other skills instead):
- Gene regulation lookup without data files -> Use existing epigenomics annotation pattern
- RNA-seq differential expression -> Use `tooluniverse-rnaseq-deseq2`
- Variant calling/annotation from VCF -> Use `tooluniverse-variant-analysis`
- Gene enrichment analysis -> Use `tooluniverse-gene-enrichment`

---

## Required Python Packages

```python
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multitest as mt
# Optional: pysam (BAM/CRAM), gseapy (enrichment)
from tooluniverse import ToolUniverse
```

---

## KEY PRINCIPLES

1. **Data-first approach** - Load and inspect data files BEFORE any analysis
2. **Question-driven** - Parse what the user is actually asking and extract the specific numeric answer
3. **File format detection** - Automatically detect methylation arrays, BED files, BigWig, clinical data
4. **Coordinate system awareness** - Track genome build (hg19, hg38, mm10), handle chr prefix differences
5. **Statistical rigor** - Proper multiple testing correction, effect size filtering, sample size awareness
6. **Missing data handling** - Explicitly report and handle NaN/missing values
7. **Chromosome normalization** - Always normalize chromosome names (chr1 vs 1, chrX vs X)
8. **Report-first** - Create output file first, populate progressively
9. **English-first queries** - Use English in all tool calls

---

## Complete Workflow

### Phase 0: Question Parsing and Data Discovery

**CRITICAL FIRST STEP**: Before writing ANY code, parse the question to identify what is being asked and what data files are available.

```python
import os, glob
data_dir = "."
all_files = glob.glob(os.path.join(data_dir, "**/*"), recursive=True)

methylation_files = [f for f in all_files if any(x in f.lower() for x in
    ['methyl', 'beta', 'cpg', 'illumina', '450k', '850k', 'epic', 'mval'])]
chipseq_files = [f for f in all_files if any(x in f.lower() for x in
    ['chip', 'peak', 'narrowpeak', 'broadpeak', 'histone'])]
atacseq_files = [f for f in all_files if any(x in f.lower() for x in
    ['atac', 'accessibility', 'openChromatin', 'dnase'])]
bed_files = [f for f in all_files if f.endswith(('.bed', '.bed.gz', '.narrowPeak', '.broadPeak'))]
clinical_files = [f for f in all_files if any(x in f.lower() for x in
    ['clinical', 'patient', 'sample', 'metadata', 'phenotype', 'survival'])]
expression_files = [f for f in all_files if any(x in f.lower() for x in
    ['express', 'rnaseq', 'fpkm', 'tpm', 'counts', 'transcriptom'])]
manifest_files = [f for f in all_files if any(x in f.lower() for x in
    ['manifest', 'annotation', 'probe', 'platform'])]
```

**Extract parameters from question**:

| Parameter | Default | Example |
|-----------|---------|---------|
| Significance threshold | 0.05 | "padj < 0.05", "FDR < 0.01" |
| Beta difference threshold | 0 | "|delta_beta| > 0.2" |
| Variance filter | None | "variance > 0.01", "top 5000 most variable" |
| Chromosome filter | All | "chromosome 17", "autosomes only" |
| Genome build | hg38 | "hg19", "GRCh37", "mm10" |

**Decision Tree**:
```
METHYLATION data? -> Phase 1 (load, filter, differential methylation, age CpG, density)
CHIP-SEQ data?    -> Phase 2 (load BED, peak stats, annotation, overlap)
ATAC-SEQ data?    -> Phase 3 (open chromatin, NFR analysis)
MULTI-OMICS?      -> Phase 4 (expression-methylation correlation)
CLINICAL?         -> Phase 5 (missing data, complete cases)
ANNOTATION?       -> Phase 6 (ToolUniverse regulatory annotation)
GENOME-WIDE?      -> Phase 7 (chromosome density, ratios)
```

---

### Phase 1: Methylation Data Processing

**Core functions** (copy-paste ready):

```python
def load_methylation_data(file_path):
    """Load methylation beta/M-value matrix. Rows=probes, Cols=samples."""
    ext = os.path.splitext(file_path)[1].lower()
    if ext in ['.csv']: return pd.read_csv(file_path, index_col=0)
    elif ext in ['.tsv', '.txt']: return pd.read_csv(file_path, sep='\t', index_col=0)
    elif ext in ['.parquet']: return pd.read_parquet(file_path)
    else:
        try: return pd.read_csv(file_path, sep='\t', index_col=0)
        except: return pd.read_csv(file_path, index_col=0)

def normalize_chromosome(chrom):
    if chrom is None or pd.isna(chrom): return None
    chrom = str(chrom).strip()
    return chrom if chrom.startswith('chr') else 'chr' + chrom

def differential_methylation(beta_df, group1_samples, group2_samples,
                              test='ttest', correction='fdr_bh', alpha=0.05):
    """DMP analysis between two groups. Returns DataFrame with padj, delta_beta."""
    g1, g2 = beta_df[group1_samples], beta_df[group2_samples]
    results = []
    for probe in beta_df.index:
        vals1, vals2 = g1.loc[probe].dropna().values, g2.loc[probe].dropna().values
        if len(vals1) < 2 or len(vals2) < 2:
            results.append({'probe': probe, 'mean_g1': np.nan, 'mean_g2': np.nan,
                'delta_beta': np.nan, 'pvalue': np.nan}); continue
        mean1, mean2 = np.nanmean(vals1), np.nanmean(vals2)
        if test == 'ttest': _, pval = stats.ttest_ind(vals1, vals2, equal_var=False)
        elif test == 'wilcoxon': _, pval = stats.mannwhitneyu(vals1, vals2, alternative='two-sided')
        else: _, pval = stats.ttest_ind(vals1, vals2, equal_var=False)
        results.append({'probe': probe, 'mean_g1': mean1, 'mean_g2': mean2,
            'delta_beta': mean2 - mean1, 'pvalue': pval})
    result_df = pd.DataFrame(results).set_index('probe')
    valid = result_df['pvalue'].dropna()
    if len(valid) > 0:
        _, padj, _, _ = mt.multipletests(valid.values, alpha=alpha, method=correction)
        result_df.loc[valid.index, 'padj'] = padj
    else: result_df['padj'] = np.nan
    return result_df

def chromosome_cpg_density(cpg_probes, manifest, genome='hg38'):
    """CpG density per chromosome. Returns DataFrame with chr, n_cpgs, density_per_bp."""
    chr_lengths = get_chromosome_lengths(genome)
    probe_id_col = 'probe_id' if 'probe_id' in manifest.columns else manifest.columns[0]
    probe_chr = manifest.set_index(probe_id_col) if probe_id_col in manifest.columns else manifest
    chr_col = 'chr' if 'chr' in probe_chr.columns else 'CHR'
    probe_chrs = probe_chr.loc[probe_chr.index.isin(cpg_probes), chr_col].apply(normalize_chromosome)
    chr_counts = probe_chrs.value_counts()
    results = []
    for chrom, count in chr_counts.items():
        if chrom in chr_lengths:
            length = chr_lengths[chrom]
            results.append({'chr': chrom, 'n_cpgs': count, 'chr_length': length,
                'density_per_bp': count / length, 'density_per_mb': count / length * 1e6})
    return pd.DataFrame(results).sort_values('chr',
        key=lambda x: x.str.replace('chr','').replace({'X':'23','Y':'24'}).astype(int))
```

**Genome chromosome lengths** (hg38/hg19/mm10): see `get_chromosome_lengths()` in REFERENCE.md.

---

### Phase 2-3: ChIP-seq / ATAC-seq (summary)

```python
def load_bed_file(file_path, format='auto'):
    """Load BED/narrowPeak/broadPeak. Returns DataFrame with chrom, start, end, ..."""
    # Auto-detect format from extension, read TSV, normalize chromosomes
    # Full implementation in REFERENCE.md

def peak_statistics(peaks_df):
    """Return dict: total_peaks, mean/median_peak_length, total_coverage_bp, per-chromosome counts."""

def find_overlaps(peaks_a, peaks_b, min_overlap=1):
    """Pure-Python interval overlap between two BED DataFrames."""

def annotate_peaks_to_genes(peaks_df, gene_annotation, tss_upstream=2000):
    """Annotate peaks to nearest gene: promoter/gene_body/proximal/distal."""
```

Full implementations with all edge cases are in REFERENCE.md.

---

### Phase 4-5: Multi-Omics & Clinical Integration (summary)

```python
def correlate_methylation_expression(beta_df, expression_df, probe_gene_map, method='pearson'):
    """Correlate methylation with expression per probe-gene pair. Returns corr + padj."""

def missing_data_analysis(clinical_df=None, expression_df=None, methylation_df=None):
    """Count patients with complete data across modalities."""

def find_complete_cases(data_frames, variables=None):
    """Find sample IDs present in ALL provided DataFrames with no missing values."""
```

Full implementations in REFERENCE.md.

---

### Phase 6: ToolUniverse Annotation Tools

| Tool | Key Parameters | Returns |
|------|---------------|---------|
| `ensembl_lookup_gene` | `id`, `species='homo_sapiens'` (REQUIRED) | gene coords, biotype |
| `ensembl_get_regulatory_features` | `region` (NO "chr" prefix!), `feature`, `species` | regulatory features |
| `SCREEN_get_regulatory_elements` | `gene_name`, `element_type`, `limit` | cCREs (enhancers, promoters) |
| `ChIPAtlas_get_experiments` | `operation='get_experiment_list'` (REQUIRED), `genome`, `antigen` | experiment list |
| `ReMap_get_transcription_factor_binding` | `gene_name`, `cell_type`, `limit` | TF binding sites |
| `RegulomeDB_query_variant` | `rsid` | regulatory score |
| `jaspar_search_matrices` | `search`, `collection`, `species` | TF matrices |
| `ENCODE_search_experiments` | `assay_title`, `target`, `organism`, `limit` | experiment metadata |

**CRITICAL**: Ensembl region format is `"17:start-end"` (NO "chr" prefix). ChIPAtlas/FourDN ALL require `operation` parameter.

---

## Common Use Patterns

**Methylation array**: Load beta matrix + manifest → filter CpGs → define groups → `differential_methylation()` → apply thresholds → report DMPs

**Age-related CpG density**: Load beta + ages → correlate per probe → filter significant → map to chromosomes → `chromosome_cpg_density()` → compute ratios

**Multi-omics missing data**: Load clinical + expression + methylation → extract sample IDs → find intersection → check NaN in clinical vars → report complete cases

**ChIP-seq annotation**: Load BED → load gene annotation → `annotate_peaks_to_genes()` → `classify_peak_regions()` → report fractions

**Methylation-expression**: Load both matrices → build probe-gene map → align samples → `correlate_methylation_expression()` → report anti-correlations

---

## Fallback Strategies

| Scenario | Fallback |
|----------|----------|
| No manifest file | Build minimal from Ensembl lookup |
| No pybedtools | Pure Python overlap (pandas intervals) |
| Low sample count | Non-parametric test (Wilcoxon) |
| Large dataset (>500K probes) | Pre-filter by variance, chunk processing |
| Sample ID mismatches | Try truncating TCGA barcodes, partial matching |

---

> **Extended Reference**: Full code implementations for all phases (including `get_chromosome_lengths()`, `load_bed_file()`, peak overlap, multi-omics integration, edge cases, and limitations) are in `REFERENCE.md`.

Related Skills

tooluniverse

from Zaoqu-Liu/ScienceClaw

Router skill for ToolUniverse tasks. First checks if specialized tooluniverse skills (34+ skills covering disease/drug/target research, clinical decision support, genomics, epigenomics, chemical safety, systems biology, and more) can solve the problem, then falls back to general strategies for using 1400+ scientific tools. Covers tool discovery, multi-hop queries, comprehensive research workflows, disambiguation, evidence grading, and report generation. Use when users need to research any scientific topic, find biological data, or explore drug/target/disease relationships.

tooluniverse-variant-interpretation

from Zaoqu-Liu/ScienceClaw

Systematic clinical variant interpretation from raw variant calls to ACMG-classified recommendations with structural impact analysis. Aggregates evidence from ClinVar, gnomAD, CIViC, UniProt, and PDB across ACMG criteria. Produces pathogenicity scores (0-100), clinical recommendations, and treatment implications. Use when interpreting genetic variants, classifying variants of uncertain significance (VUS), performing ACMG variant classification, or translating variant calls to clinical actionability.

tooluniverse-variant-analysis

from Zaoqu-Liu/ScienceClaw

Production-ready VCF processing, variant annotation, mutation analysis, and structural variant (SV/CNV) interpretation for bioinformatics questions. Parses VCF files (streaming, large files), classifies mutation types (missense, nonsense, synonymous, frameshift, splice, intronic, intergenic) and structural variants (deletions, duplications, inversions, translocations), applies VAF/depth/quality/consequence filters, annotates with ClinVar/dbSNP/gnomAD/CADD via ToolUniverse, interprets SV/CNV clinical significance using ClinGen dosage sensitivity scores, computes variant statistics, and generates reports. Solves questions like "What fraction of variants with VAF < 0.3 are missense?", "How many non-reference variants remain after filtering intronic/intergenic?", "What is the pathogenicity of this deletion affecting BRCA1?", or "Which dosage-sensitive genes overlap this CNV?". Use when processing VCF files, annotating variants, filtering by VAF/depth/consequence, classifying mutations, interpreting structural variants, assessing CNV pathogenicity, comparing cohorts, or answering variant analysis questions.

tooluniverse-target-research

from Zaoqu-Liu/ScienceClaw

Gather comprehensive biological target intelligence from 9 parallel research paths covering protein info, structure, interactions, pathways, expression, variants, drug interactions, and literature. Features collision-aware searches, evidence grading (T1-T4), explicit Open Targets coverage, and mandatory completeness auditing. Use when users ask about drug targets, proteins, genes, or need target validation, druggability assessment, or comprehensive target profiling.

tooluniverse-systems-biology

from Zaoqu-Liu/ScienceClaw

Comprehensive systems biology and pathway analysis using multiple pathway databases (Reactome, KEGG, WikiPathways, Pathway Commons, BioModels). Performs pathway enrichment, protein-pathway mapping, keyword searches, and systems-level analysis. Use when analyzing gene sets, exploring biological pathways, or investigating systems-level biology.

tooluniverse-structural-variant-analysis

from Zaoqu-Liu/ScienceClaw

Comprehensive structural variant (SV) analysis skill for clinical genomics. Classifies SVs (deletions, duplications, inversions, translocations), assesses pathogenicity using ACMG-adapted criteria, evaluates gene disruption and dosage sensitivity, and provides clinical interpretation with evidence grading. Use when analyzing CNVs, large deletions/duplications, chromosomal rearrangements, or any structural variants requiring clinical interpretation.

tooluniverse-statistical-modeling

from Zaoqu-Liu/ScienceClaw

Perform statistical modeling and regression analysis on biomedical datasets. Supports linear regression, logistic regression (binary/ordinal/multinomial), mixed-effects models, Cox proportional hazards survival analysis, Kaplan-Meier estimation, and comprehensive model diagnostics. Extracts odds ratios, hazard ratios, confidence intervals, p-values, and effect sizes. Designed to solve BixBench statistical reasoning questions involving clinical/experimental data. Use when asked to fit regression models, compute odds ratios, perform survival analysis, run statistical tests, or interpret model coefficients from provided data.

tooluniverse-spatial-transcriptomics

from Zaoqu-Liu/ScienceClaw

Analyze spatial transcriptomics data to map gene expression in tissue architecture. Supports 10x Visium, MERFISH, seqFISH, Slide-seq, and imaging-based platforms. Performs spatial clustering, domain identification, cell-cell proximity analysis, spatial gene expression patterns, tissue architecture mapping, and integration with single-cell data. Use when analyzing spatial transcriptomics datasets, studying tissue organization, identifying spatial expression patterns, mapping cell-cell interactions in tissue context, characterizing tumor microenvironment spatial structure, or integrating spatial and single-cell RNA-seq data for comprehensive tissue analysis.

tooluniverse-spatial-omics-analysis

from Zaoqu-Liu/ScienceClaw

Computational analysis framework for spatial multi-omics data integration. Given spatially variable genes (SVGs), spatial domain annotations, tissue type, and disease context from spatial transcriptomics/proteomics experiments (10x Visium, MERFISH, DBiTplus, SLIDE-seq, etc.), performs comprehensive biological interpretation including pathway enrichment, cell-cell interaction inference, druggable target identification, immune microenvironment characterization, and multi-modal integration. Produces a detailed markdown report with Spatial Omics Integration Score (0-100), domain-by-domain characterization, and validation recommendations. Uses 70+ ToolUniverse tools across 9 analysis phases. Use when users ask about spatial transcriptomics analysis, spatial omics interpretation, tissue heterogeneity, spatial gene expression patterns, tumor microenvironment mapping, tissue zonation, or cell-cell communication from spatial data.

tooluniverse-single-cell

from Zaoqu-Liu/ScienceClaw

Production-ready single-cell and expression matrix analysis using scanpy, anndata, and scipy. Performs scRNA-seq QC, normalization, PCA, UMAP, Leiden/Louvain clustering, differential expression (Wilcoxon, t-test, DESeq2), cell type annotation, per-cell-type statistical analysis, gene-expression correlation, batch correction (Harmony), trajectory inference, and cell-cell communication analysis. NEW: Analyzes ligand-receptor interactions between cell types using OmniPath (CellPhoneDB, CellChatDB), scores communication strength, identifies signaling cascades, and handles multi-subunit receptor complexes. Integrates with ToolUniverse gene annotation tools (HPA, Ensembl, MyGene, UniProt) and enrichment tools (gseapy, PANTHER, STRING). Supports h5ad, 10X, CSV/TSV count matrices, and pre-annotated datasets. Use when analyzing single-cell RNA-seq data, studying cell-cell interactions, performing cell type differential expression, computing gene-expression correlations by cell type, analyzing tumor-immune communication, or answering questions about scRNA-seq datasets.

tooluniverse-sequence-retrieval

from Zaoqu-Liu/ScienceClaw

Retrieves biological sequences (DNA, RNA, protein) from NCBI and ENA with gene disambiguation, accession type handling, and comprehensive sequence profiles. Creates detailed reports with sequence metadata, cross-database references, and download options. Use when users need nucleotide sequences, protein sequences, genome data, or mention GenBank, RefSeq, EMBL accessions.

tooluniverse-sdk

from Zaoqu-Liu/ScienceClaw

Build AI scientist systems using ToolUniverse Python SDK for scientific research. Use when users need to access 1000++ scientific tools through Python code, create scientific workflows, perform drug discovery, protein analysis, genomics analysis, literature research, or any computational biology task. Triggers include requests to use scientific tools programmatically, build research pipelines, analyze biological data, search literature, predict drug properties, or create AI-powered scientific workflows.