bio-atac-seq-atac-qc
Quality control metrics for ATAC-seq data including fragment size distribution, TSS enrichment, FRiP, and library complexity. Use when assessing ATAC-seq library quality before or after peak calling to identify problematic samples.
Best use case
bio-atac-seq-atac-qc is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Quality control metrics for ATAC-seq data including fragment size distribution, TSS enrichment, FRiP, and library complexity. Use when assessing ATAC-seq library quality before or after peak calling to identify problematic samples.
Teams using bio-atac-seq-atac-qc should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/bio-atac-seq-atac-qc/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How bio-atac-seq-atac-qc Compares
| Feature / Agent | bio-atac-seq-atac-qc | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Quality control metrics for ATAC-seq data including fragment size distribution, TSS enrichment, FRiP, and library complexity. Use when assessing ATAC-seq library quality before or after peak calling to identify problematic samples.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
## Version Compatibility
Reference examples tested with: bedtools 2.31+, deepTools 3.5+, numpy 1.26+, pandas 2.2+, picard 3.1+, pyBigWig 0.3+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# ATAC-seq Quality Control
**"Check the quality of my ATAC-seq library"** → Evaluate fragment size distribution (nucleosome periodicity), TSS enrichment, FRiP, and library complexity to assess chromatin accessibility experiment quality.
- CLI: `deeptools bamPEFragmentSize`, `picard CollectInsertSizeMetrics`
- Python: `pysam` for custom fragment analysis
## Fragment Size Distribution
**Goal:** Assess ATAC-seq library quality by visualizing the characteristic nucleosome periodicity in fragment sizes.
**Approach:** Extract insert sizes from the BAM file using Picard or samtools, producing a distribution that should show NFR (<100 bp) and mono-nucleosome (~200 bp) peaks.
```bash
# Using Picard
java -jar picard.jar CollectInsertSizeMetrics \
I=sample.bam \
O=insert_sizes.txt \
H=insert_sizes.pdf \
M=0.5
# Using samtools
samtools view -f 66 sample.bam | \
awk '{print sqrt($9^2)}' | \
sort | uniq -c | \
awk '{print $2"\t"$1}' > fragment_sizes.txt
```
## TSS Enrichment Score
**Goal:** Quantify signal enrichment at transcription start sites as a key ATAC-seq quality metric.
**Approach:** Create a TSS BED file, compute a signal matrix around TSS positions using deepTools, then plot the enrichment profile.
```bash
# Using deepTools
# 1. Create TSS BED file (from GTF)
awk '$3=="transcript" {print $1"\t"$4-1"\t"$4"\t"$14"\t"0"\t"$7}' genes.gtf | \
tr -d '";' | sort -k1,1 -k2,2n > tss.bed
# 2. Compute matrix around TSS
computeMatrix reference-point \
-S sample.bw \
-R tss.bed \
-a 2000 -b 2000 \
-o tss_matrix.gz
# 3. Plot TSS enrichment
plotProfile -m tss_matrix.gz \
-o tss_enrichment.png \
--perGroup
```
## Calculate TSS Enrichment Score
**Goal:** Compute a numeric TSS enrichment score from a bigWig signal track.
**Approach:** Sample signal values in windows around TSS positions, average across all TSSs, then divide center signal by flanking background.
```python
import numpy as np
import pyBigWig
def calculate_tss_enrichment(bigwig_file, tss_bed, flank=2000):
'''Calculate TSS enrichment score.'''
bw = pyBigWig.open(bigwig_file)
signals = []
for line in open(tss_bed):
fields = line.strip().split('\t')
chrom, tss = fields[0], int(fields[1])
strand = fields[5] if len(fields) > 5 else '+'
try:
vals = bw.values(chrom, max(0, tss - flank), tss + flank)
if strand == '-':
vals = vals[::-1]
signals.append(vals)
except:
continue
avg_signal = np.nanmean(signals, axis=0)
# TSS enrichment = signal at TSS / background
background = np.nanmean([avg_signal[:100], avg_signal[-100:]])
tss_signal = np.nanmean(avg_signal[flank-50:flank+50])
enrichment = tss_signal / background if background > 0 else 0
return enrichment, avg_signal
enrichment, signal = calculate_tss_enrichment('sample.bw', 'tss.bed')
print(f'TSS Enrichment Score: {enrichment:.2f}')
```
## FRiP (Fraction of Reads in Peaks)
```bash
# Total reads
total=$(samtools view -c -F 4 sample.bam)
# Reads in peaks
in_peaks=$(bedtools intersect -a sample.bam -b peaks.narrowPeak -u | \
samtools view -c)
# FRiP
frip=$(echo "scale=4; $in_peaks / $total" | bc)
echo "FRiP: $frip"
# Good FRiP for ATAC-seq: >0.2 (20%)
```
## Mitochondrial Read Fraction
```bash
# Mitochondrial reads
mt_reads=$(samtools view -c sample.bam chrM)
total_reads=$(samtools view -c sample.bam)
mt_frac=$(echo "scale=4; $mt_reads / $total_reads" | bc)
echo "Mitochondrial fraction: $mt_frac"
# Ideal: <20%, concerning: >50%
```
## Library Complexity (NRF, PBC1, PBC2)
**Goal:** Measure library complexity to detect over-amplification or low-diversity libraries.
**Approach:** Calculate NRF (unique/total reads), PBC1 (1-read locations / all locations), and PBC2 (1-read / 2-read locations) using Picard or custom counting.
```bash
# Using Picard EstimateLibraryComplexity
java -jar picard.jar EstimateLibraryComplexity \
I=sample.bam \
O=complexity.txt
# Or calculate from BAM
# NRF = unique reads / total reads
# PBC1 = locations with exactly 1 read / locations with >= 1 read
# PBC2 = locations with exactly 1 read / locations with exactly 2 reads
```
```python
import pysam
def calculate_complexity(bam_file):
'''Calculate library complexity metrics.'''
bam = pysam.AlignmentFile(bam_file, 'rb')
positions = {}
total = 0
for read in bam.fetch():
if read.is_unmapped or read.is_secondary:
continue
total += 1
pos = (read.reference_name, read.reference_start)
positions[pos] = positions.get(pos, 0) + 1
distinct = len(positions)
m1 = sum(1 for v in positions.values() if v == 1)
m2 = sum(1 for v in positions.values() if v == 2)
nrf = distinct / total if total > 0 else 0
pbc1 = m1 / distinct if distinct > 0 else 0
pbc2 = m1 / m2 if m2 > 0 else 0
return {'NRF': nrf, 'PBC1': pbc1, 'PBC2': pbc2}
```
## deepTools QC
```bash
# Fingerprint plot (assesses enrichment)
plotFingerprint \
-b sample.bam \
--labels sample \
-o fingerprint.png
# Correlation between replicates
multiBamSummary bins \
-b sample1.bam sample2.bam \
-o results.npz
plotCorrelation \
-in results.npz \
--corMethod pearson \
--whatToPlot heatmap \
-o correlation.png
```
## ATACseqQC (R)
```r
library(ATACseqQC)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# Read BAM
bamfile <- 'sample.bam'
# Fragment size distribution
fragSizeDist(bamfile, 'fragment_size.pdf')
# TSS enrichment
tsse <- TSSEscore(bamfile, TxDb.Hsapiens.UCSC.hg38.knownGene)
print(paste('TSS Enrichment:', round(tsse$TSSEscore, 2)))
# Nucleosome positioning
nucs <- nucleosomePositioningScore(bamfile, TxDb.Hsapiens.UCSC.hg38.knownGene)
```
## Comprehensive QC Report
**Goal:** Generate a single QC summary combining all major ATAC-seq quality metrics.
**Approach:** Run samtools and bedtools commands to collect total reads, mapping rate, mitochondrial fraction, FRiP, and peak count, then write a consolidated report.
```python
import subprocess
import pandas as pd
def atac_qc_report(bam_file, peaks_file, output_prefix):
'''Generate comprehensive ATAC-seq QC report.'''
metrics = {}
# Total reads
result = subprocess.check_output(f'samtools view -c -F 4 {bam_file}', shell=True)
metrics['total_reads'] = int(result.strip())
# Mapped reads
result = subprocess.check_output(f'samtools view -c -F 4 -F 256 {bam_file}', shell=True)
metrics['mapped_reads'] = int(result.strip())
# Mitochondrial reads
result = subprocess.check_output(f'samtools view -c {bam_file} chrM', shell=True)
metrics['mt_reads'] = int(result.strip())
metrics['mt_fraction'] = metrics['mt_reads'] / metrics['total_reads']
# Reads in peaks (FRiP)
result = subprocess.check_output(
f'bedtools intersect -a {bam_file} -b {peaks_file} -u | samtools view -c', shell=True)
metrics['reads_in_peaks'] = int(result.strip())
metrics['frip'] = metrics['reads_in_peaks'] / metrics['total_reads']
# Peak count
result = subprocess.check_output(f'wc -l < {peaks_file}', shell=True)
metrics['peak_count'] = int(result.strip())
# Write report
with open(f'{output_prefix}_qc.txt', 'w') as f:
for k, v in metrics.items():
if isinstance(v, float):
f.write(f'{k}: {v:.4f}\n')
else:
f.write(f'{k}: {v}\n')
return metrics
```
## QC Thresholds
| Metric | Good | Acceptable | Poor |
|--------|------|------------|------|
| TSS Enrichment | >10 | 5-10 | <5 |
| FRiP | >0.3 | 0.1-0.3 | <0.1 |
| MT Fraction | <0.1 | 0.1-0.3 | >0.3 |
| NRF | >0.9 | 0.8-0.9 | <0.8 |
| PBC1 | >0.9 | 0.7-0.9 | <0.7 |
## Related Skills
- atac-seq/atac-peak-calling - Peak calling
- alignment-files/bam-statistics - Alignment QC
- chip-seq/chipseq-visualization - Visualization approachesRelated Skills
datacommons-client
Work with Data Commons, a platform providing programmatic access to public statistical data from global sources. Use this skill when working with demographic data, economic indicators, health statistics, environmental data, or any public datasets available through Data Commons. Applicable for querying population statistics, GDP figures, unemployment rates, disease prevalence, geographic entity resolution, and exploring relationships between statistical entities.
bio-single-cell-scatac-analysis
Single-cell ATAC-seq analysis with Signac (R/Seurat) and ArchR. Process 10X Genomics scATAC data, perform QC, dimensionality reduction, clustering, peak calling, and motif activity scoring with chromVAR. Use when analyzing single-cell ATAC-seq data.
bio-atac-seq-nucleosome-positioning
Extract nucleosome positions from ATAC-seq data using NucleoATAC, ATACseqQC, and fragment analysis. Use when analyzing chromatin organization, identifying nucleosome-free regions at promoters, or characterizing nucleosome occupancy patterns from ATAC-seq fragment size distributions.
bio-atac-seq-motif-deviation
Analyze transcription factor motif accessibility variability using chromVAR. Use when identifying which TF motifs show variable accessibility across samples or conditions in ATAC-seq data.
bio-atac-seq-footprinting
Detect transcription factor binding sites through footprinting analysis in ATAC-seq data using TOBIAS. Use when identifying TF occupancy patterns within accessible regions, as TF binding protects DNA from Tn5 cutting.
bio-atac-seq-differential-accessibility
Find differentially accessible chromatin regions between conditions using DiffBind or DESeq2. Use when comparing chromatin accessibility between treatment groups, cell types, or developmental stages in ATAC-seq experiments.
bio-atac-seq-atac-peak-calling
Call accessible chromatin regions from ATAC-seq data using MACS3 with ATAC-specific parameters. Use when identifying open chromatin regions from aligned ATAC-seq BAM files, different from ChIP-seq peak calling.
zinc-database
Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.
zarr-python
Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.
xlsx
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
writing-skills
Use when creating new skills, editing existing skills, or verifying skills work before deployment
writing-plans
Use when you have a spec or requirements for a multi-step task, before touching code