bio-cfdna-preprocessing
Preprocesses cell-free DNA sequencing data including adapter trimming, alignment optimized for short fragments, and UMI-aware duplicate removal using fgbio. Applies cfDNA-specific quality thresholds and fragment length filtering. Use when processing plasma cfDNA sequencing data before downstream analysis.
Best use case
bio-cfdna-preprocessing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Preprocesses cell-free DNA sequencing data including adapter trimming, alignment optimized for short fragments, and UMI-aware duplicate removal using fgbio. Applies cfDNA-specific quality thresholds and fragment length filtering. Use when processing plasma cfDNA sequencing data before downstream analysis.
Teams using bio-cfdna-preprocessing should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/bio-cfdna-preprocessing/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How bio-cfdna-preprocessing Compares
| Feature / Agent | bio-cfdna-preprocessing | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Preprocesses cell-free DNA sequencing data including adapter trimming, alignment optimized for short fragments, and UMI-aware duplicate removal using fgbio. Applies cfDNA-specific quality thresholds and fragment length filtering. Use when processing plasma cfDNA sequencing data before downstream analysis.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Top AI Agents for Productivity
See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.
SKILL.md Source
## Version Compatibility
Reference examples tested with: BWA 0.7.17+, fgbio 2.1+, matplotlib 3.8+, numpy 1.26+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# cfDNA Preprocessing
**"Preprocess my cfDNA sequencing data"** → Process cell-free DNA reads with UMI extraction, consensus calling, and error suppression for sensitive variant detection.
- CLI: `fgbio FastqToBam` → `fgbio GroupReadsByUmi` → `fgbio CallMolecularConsensusReads`
Preprocess cell-free DNA sequencing data with UMI-aware deduplication.
## Pre-Analytical Considerations
| Factor | Requirement | Rationale |
|--------|-------------|-----------|
| Collection tube | Streck (7 days) or EDTA (6 hrs) | Prevents cell lysis |
| Processing time | ASAP or per tube specs | Minimizes genomic DNA contamination |
| Hemolysis | Avoid | Releases cellular DNA |
| Storage | -80C after extraction | Prevents degradation |
## UMI-Aware Pipeline with fgbio
```bash
# fgbio 3.0+ (actively maintained)
# Step 1: Extract UMIs from reads and annotate
fgbio ExtractUmisFromBam \
--input raw.bam \
--output with_umis.bam \
--read-structure 3M2S+T 3M2S+T \
--molecular-index-tags ZA ZB \
--single-tag RX
# Step 2: Align with BWA-MEM
# Use -Y for soft-clipping (preserves UMIs)
bwa mem -t 8 -Y reference.fa with_umis.bam | \
samtools view -bS - > aligned.bam
# Step 3: Group reads by UMI
fgbio GroupReadsByUmi \
--input aligned.bam \
--output grouped.bam \
--strategy adjacency \
--edits 1 \
--min-map-q 20
# Step 4: Call molecular consensus reads
fgbio CallMolecularConsensusReads \
--input grouped.bam \
--output consensus.bam \
--min-reads 2 \
--min-input-base-quality 20
# Step 5: Filter consensus reads
fgbio FilterConsensusReads \
--input consensus.bam \
--output filtered_consensus.bam \
--ref reference.fa \
--min-reads 2 \
--max-read-error-rate 0.05 \
--min-base-quality 30
```
## Python Implementation
**Goal:** Run the complete cfDNA UMI-consensus pipeline from raw BAM to error-suppressed consensus reads in a single Python function call.
**Approach:** Chain fgbio operations (UMI extraction, grouping, consensus calling, filtering) with BWA alignment, handling intermediate files and cleanup within the function.
```python
import subprocess
import pysam
from pathlib import Path
def preprocess_cfdna(input_bam, output_bam, reference, read_structure='3M2S+T 3M2S+T',
min_reads=2, threads=8):
'''
Full cfDNA preprocessing pipeline with fgbio.
Args:
input_bam: Input BAM with UMIs in reads
output_bam: Output consensus BAM
reference: Reference FASTA path
read_structure: UMI read structure
min_reads: Minimum reads per UMI group
threads: CPU threads
'''
work_dir = Path(output_bam).parent
prefix = Path(output_bam).stem
# Extract UMIs
with_umis = work_dir / f'{prefix}_umis.bam'
subprocess.run([
'fgbio', 'ExtractUmisFromBam',
'--input', input_bam,
'--output', str(with_umis),
'--read-structure', read_structure,
'--single-tag', 'RX'
], check=True)
# Align
aligned = work_dir / f'{prefix}_aligned.bam'
cmd = f'bwa mem -t {threads} -Y {reference} {with_umis} | samtools view -bS - > {aligned}'
subprocess.run(cmd, shell=True, check=True)
# Sort
sorted_bam = work_dir / f'{prefix}_sorted.bam'
pysam.sort('-@', str(threads), '-o', str(sorted_bam), str(aligned))
# Group by UMI
grouped = work_dir / f'{prefix}_grouped.bam'
subprocess.run([
'fgbio', 'GroupReadsByUmi',
'--input', str(sorted_bam),
'--output', str(grouped),
'--strategy', 'adjacency',
'--edits', '1'
], check=True)
# Consensus calling
consensus = work_dir / f'{prefix}_consensus.bam'
subprocess.run([
'fgbio', 'CallMolecularConsensusReads',
'--input', str(grouped),
'--output', str(consensus),
'--min-reads', str(min_reads)
], check=True)
# Filter consensus
subprocess.run([
'fgbio', 'FilterConsensusReads',
'--input', str(consensus),
'--output', output_bam,
'--ref', reference,
'--min-reads', str(min_reads)
], check=True)
return output_bam
```
## Fragment Size Analysis
```python
import pysam
import numpy as np
import matplotlib.pyplot as plt
def analyze_fragment_sizes(bam_path, max_size=500):
'''Analyze cfDNA fragment size distribution.'''
bam = pysam.AlignmentFile(bam_path, 'rb')
sizes = []
for read in bam.fetch():
if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
if read.template_length <= max_size:
sizes.append(read.template_length)
bam.close()
# cfDNA signature: peak at ~167bp (mononucleosome)
# Shorter fragments (90-150bp) enriched in ctDNA
sizes = np.array(sizes)
print(f'Fragments analyzed: {len(sizes)}')
print(f'Median size: {np.median(sizes):.0f} bp')
print(f'Mode: {np.bincount(sizes).argmax()} bp')
return sizes
```
## Quality Thresholds
| Metric | Threshold | Notes |
|--------|-----------|-------|
| Modal fragment size | 150-180 bp | Peak ~167 bp indicates good cfDNA |
| UMI families >= 2 reads | > 50% | Sufficient for consensus |
| Mean base quality | >= 30 | After consensus |
| Mapping quality | >= 20 | Exclude multi-mappers |
## Related Skills
- fragment-analysis - Analyze fragmentomics after preprocessing
- tumor-fraction-estimation - Estimate ctDNA from sWGS
- ctdna-mutation-detection - Detect mutations from panel dataRelated Skills
tcga-bulk-data-preprocessing-with-omicverse
Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.
single-cell-preprocessing-with-omicverse
Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.
bio-spatial-transcriptomics-spatial-preprocessing
Quality control, filtering, normalization, and feature selection for spatial transcriptomics data. Calculate QC metrics, filter spots/cells, normalize counts, and identify highly variable genes. Use when filtering and normalizing spatial transcriptomics data.
bio-single-cell-preprocessing
Quality control, filtering, and normalization for single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for calculating QC metrics, filtering cells and genes, normalizing counts, identifying highly variable genes, and scaling data. Use when filtering, normalizing, and selecting features in single-cell data.
bio-ribo-seq-riboseq-preprocessing
Preprocess ribosome profiling data including adapter trimming, size selection, rRNA removal, and alignment. Use when preparing Ribo-seq reads for downstream analysis of translation.
bio-metabolomics-xcms-preprocessing
XCMS3 workflow for LC-MS/MS metabolomics preprocessing. Covers peak detection, retention time alignment, correspondence (grouping), and gap filling. Use when processing raw LC-MS data into a feature table for untargeted metabolomics.
bio-metabolomics-msdial-preprocessing
MS-DIAL-based metabolomics preprocessing as alternative to XCMS. Covers peak detection, alignment, annotation, and export for downstream analysis. Use when processing MS-DIAL output files for R/Python analysis or when preferring GUI-based preprocessing.
bio-imaging-mass-cytometry-data-preprocessing
Load and preprocess imaging mass cytometry (IMC) and MIBI data. Covers MCD/TIFF handling, hot pixel removal, and image normalization. Use when starting IMC analysis from raw MCD files or preparing images for segmentation.
zinc-database
Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.
zarr-python
Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.
xlsx
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
writing-skills
Use when creating new skills, editing existing skills, or verifying skills work before deployment