bio-cfdna-preprocessing

Preprocesses cell-free DNA sequencing data including adapter trimming, alignment optimized for short fragments, and UMI-aware duplicate removal using fgbio. Applies cfDNA-specific quality thresholds and fragment length filtering. Use when processing plasma cfDNA sequencing data before downstream analysis.

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

bio-cfdna-preprocessing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using bio-cfdna-preprocessing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-cfdna-preprocessing/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-cfdna-preprocessing/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bio-cfdna-preprocessing/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bio-cfdna-preprocessing Compares

Feature / Agent	bio-cfdna-preprocessing	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Top AI Agents for Productivity

See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.

SKILL.md Source

## Version Compatibility

Reference examples tested with: BWA 0.7.17+, fgbio 2.1+, matplotlib 3.8+, numpy 1.26+, pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# cfDNA Preprocessing

**"Preprocess my cfDNA sequencing data"** → Process cell-free DNA reads with UMI extraction, consensus calling, and error suppression for sensitive variant detection.
- CLI: `fgbio FastqToBam` → `fgbio GroupReadsByUmi` → `fgbio CallMolecularConsensusReads`

Preprocess cell-free DNA sequencing data with UMI-aware deduplication.

## Pre-Analytical Considerations

| Factor | Requirement | Rationale |
|--------|-------------|-----------|
| Collection tube | Streck (7 days) or EDTA (6 hrs) | Prevents cell lysis |
| Processing time | ASAP or per tube specs | Minimizes genomic DNA contamination |
| Hemolysis | Avoid | Releases cellular DNA |
| Storage | -80C after extraction | Prevents degradation |

## UMI-Aware Pipeline with fgbio

```bash
# fgbio 3.0+ (actively maintained)

# Step 1: Extract UMIs from reads and annotate
fgbio ExtractUmisFromBam \
    --input raw.bam \
    --output with_umis.bam \
    --read-structure 3M2S+T 3M2S+T \
    --molecular-index-tags ZA ZB \
    --single-tag RX

# Step 2: Align with BWA-MEM
# Use -Y for soft-clipping (preserves UMIs)
bwa mem -t 8 -Y reference.fa with_umis.bam | \
    samtools view -bS - > aligned.bam

# Step 3: Group reads by UMI
fgbio GroupReadsByUmi \
    --input aligned.bam \
    --output grouped.bam \
    --strategy adjacency \
    --edits 1 \
    --min-map-q 20

# Step 4: Call molecular consensus reads
fgbio CallMolecularConsensusReads \
    --input grouped.bam \
    --output consensus.bam \
    --min-reads 2 \
    --min-input-base-quality 20

# Step 5: Filter consensus reads
fgbio FilterConsensusReads \
    --input consensus.bam \
    --output filtered_consensus.bam \
    --ref reference.fa \
    --min-reads 2 \
    --max-read-error-rate 0.05 \
    --min-base-quality 30
```

## Python Implementation

**Goal:** Run the complete cfDNA UMI-consensus pipeline from raw BAM to error-suppressed consensus reads in a single Python function call.

**Approach:** Chain fgbio operations (UMI extraction, grouping, consensus calling, filtering) with BWA alignment, handling intermediate files and cleanup within the function.

```python
import subprocess
import pysam
from pathlib import Path


def preprocess_cfdna(input_bam, output_bam, reference, read_structure='3M2S+T 3M2S+T',
                     min_reads=2, threads=8):
    '''
    Full cfDNA preprocessing pipeline with fgbio.

    Args:
        input_bam: Input BAM with UMIs in reads
        output_bam: Output consensus BAM
        reference: Reference FASTA path
        read_structure: UMI read structure
        min_reads: Minimum reads per UMI group
        threads: CPU threads
    '''
    work_dir = Path(output_bam).parent
    prefix = Path(output_bam).stem

    # Extract UMIs
    with_umis = work_dir / f'{prefix}_umis.bam'
    subprocess.run([
        'fgbio', 'ExtractUmisFromBam',
        '--input', input_bam,
        '--output', str(with_umis),
        '--read-structure', read_structure,
        '--single-tag', 'RX'
    ], check=True)

    # Align
    aligned = work_dir / f'{prefix}_aligned.bam'
    cmd = f'bwa mem -t {threads} -Y {reference} {with_umis} | samtools view -bS - > {aligned}'
    subprocess.run(cmd, shell=True, check=True)

    # Sort
    sorted_bam = work_dir / f'{prefix}_sorted.bam'
    pysam.sort('-@', str(threads), '-o', str(sorted_bam), str(aligned))

    # Group by UMI
    grouped = work_dir / f'{prefix}_grouped.bam'
    subprocess.run([
        'fgbio', 'GroupReadsByUmi',
        '--input', str(sorted_bam),
        '--output', str(grouped),
        '--strategy', 'adjacency',
        '--edits', '1'
    ], check=True)

    # Consensus calling
    consensus = work_dir / f'{prefix}_consensus.bam'
    subprocess.run([
        'fgbio', 'CallMolecularConsensusReads',
        '--input', str(grouped),
        '--output', str(consensus),
        '--min-reads', str(min_reads)
    ], check=True)

    # Filter consensus
    subprocess.run([
        'fgbio', 'FilterConsensusReads',
        '--input', str(consensus),
        '--output', output_bam,
        '--ref', reference,
        '--min-reads', str(min_reads)
    ], check=True)

    return output_bam
```

## Fragment Size Analysis

```python
import pysam
import numpy as np
import matplotlib.pyplot as plt


def analyze_fragment_sizes(bam_path, max_size=500):
    '''Analyze cfDNA fragment size distribution.'''
    bam = pysam.AlignmentFile(bam_path, 'rb')
    sizes = []

    for read in bam.fetch():
        if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
            if read.template_length <= max_size:
                sizes.append(read.template_length)

    bam.close()

    # cfDNA signature: peak at ~167bp (mononucleosome)
    # Shorter fragments (90-150bp) enriched in ctDNA
    sizes = np.array(sizes)

    print(f'Fragments analyzed: {len(sizes)}')
    print(f'Median size: {np.median(sizes):.0f} bp')
    print(f'Mode: {np.bincount(sizes).argmax()} bp')

    return sizes
```

## Quality Thresholds

| Metric | Threshold | Notes |
|--------|-----------|-------|
| Modal fragment size | 150-180 bp | Peak ~167 bp indicates good cfDNA |
| UMI families >= 2 reads | > 50% | Sufficient for consensus |
| Mean base quality | >= 30 | After consensus |
| Mapping quality | >= 20 | Exclude multi-mappers |

## Related Skills

- fragment-analysis - Analyze fragmentomics after preprocessing
- tumor-fraction-estimation - Estimate ctDNA from sWGS
- ctdna-mutation-detection - Detect mutations from panel data

Related Skills

tcga-bulk-data-preprocessing-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.

single-cell-preprocessing-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.

bio-spatial-transcriptomics-spatial-preprocessing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Quality control, filtering, normalization, and feature selection for spatial transcriptomics data. Calculate QC metrics, filter spots/cells, normalize counts, and identify highly variable genes. Use when filtering and normalizing spatial transcriptomics data.

bio-single-cell-preprocessing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Quality control, filtering, and normalization for single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for calculating QC metrics, filtering cells and genes, normalizing counts, identifying highly variable genes, and scaling data. Use when filtering, normalizing, and selecting features in single-cell data.

bio-ribo-seq-riboseq-preprocessing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Preprocess ribosome profiling data including adapter trimming, size selection, rRNA removal, and alignment. Use when preparing Ribo-seq reads for downstream analysis of translation.

bio-metabolomics-xcms-preprocessing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

XCMS3 workflow for LC-MS/MS metabolomics preprocessing. Covers peak detection, retention time alignment, correspondence (grouping), and gap filling. Use when processing raw LC-MS data into a feature table for untargeted metabolomics.

bio-metabolomics-msdial-preprocessing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

MS-DIAL-based metabolomics preprocessing as alternative to XCMS. Covers peak detection, alignment, annotation, and export for downstream analysis. Use when processing MS-DIAL output files for R/Python analysis or when preferring GUI-based preprocessing.

bio-imaging-mass-cytometry-data-preprocessing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Load and preprocess imaging mass cytometry (IMC) and MIBI data. Covers MCD/TIFF handling, hot pixel removal, and image normalization. Use when starting IMC analysis from raw MCD files or preparing images for segmentation.

zinc-database

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

zarr-python

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

xlsx

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

writing-skills

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use when creating new skills, editing existing skills, or verifying skills work before deployment