bio-read-qc-umi-processing

Extract, process, and deduplicate reads using Unique Molecular Identifiers (UMIs) with umi_tools. Use when library prep includes UMIs and accurate molecule counting is needed, such as in single-cell RNA-seq, low-input RNA-seq, or targeted sequencing to distinguish PCR from biological duplicates.

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

bio-read-qc-umi-processing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using bio-read-qc-umi-processing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-read-qc-umi-processing/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-read-qc-umi-processing/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bio-read-qc-umi-processing/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bio-read-qc-umi-processing Compares

Feature / Agent	bio-read-qc-umi-processing	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Top AI Agents for Productivity

See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.

SKILL.md Source

## Version Compatibility

Reference examples tested with: pandas 2.2+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# UMI Processing

**"Deduplicate reads using UMIs"** → Extract UMI barcodes, group reads by UMI+position, and collapse PCR duplicates to count unique molecules.
- CLI: `umi_tools extract` + `umi_tools dedup` (UMI-tools)
- CLI: `fgbio GroupReadsByUmi` + `fgbio CallMolecularConsensusReads`

UMIs (Unique Molecular Identifiers) are short random sequences added during library preparation to tag individual molecules before PCR amplification. This enables accurate PCR duplicate removal and molecule counting.

## UMI Workflow Overview

```
Raw FASTQ with UMIs
    |
    v
[umi_tools extract] --> Move UMI to read header
    |
    v
[Alignment] --> bwa/STAR/bowtie2
    |
    v
[umi_tools dedup] --> Remove PCR duplicates based on UMI + position
    |
    v
Deduplicated BAM
```

## Extract UMIs from Reads

### UMI in Read Sequence

```bash
# UMI at start of R1 (8bp UMI)
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern=NNNNNNNN

# UMI at start of R2
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern2=NNNNNNNN

# UMI in both reads
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern=NNNNNNNN \
    --bc-pattern2=NNNNNNNN
```

### UMI Pattern Syntax

| Pattern | Meaning |
|---------|---------|
| `N` | UMI base (extracted) |
| `C` | Cell barcode (extracted, kept separate) |
| `X` | Discard base |
| `NNNNNNNN` | 8bp UMI |
| `CCCCCCCCNNNNNNNN` | 8bp cell barcode + 8bp UMI |
| `NNNXXXNNN` | 3bp UMI, skip 3bp, 3bp UMI |

### Complex Patterns

```bash
# 10X Genomics 3' v3 (16bp cell barcode + 12bp UMI in R1)
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNNNN

# Skip bases between barcode and UMI
umi_tools extract \
    --stdin=R1.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --bc-pattern=NNNNNNNNXXXX  # 8bp UMI, skip 4bp

# Fixed anchor sequence
umi_tools extract \
    --stdin=R1.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --bc-pattern='(?P<umi_1>.{8})ATGC(?P<discard_1>.{4})'
```

### UMI in Separate Index Read

```bash
# UMI in I1 index read
umi_tools extract \
    --stdin=R1.fastq.gz \
    --read2-in=R2.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --bc-pattern=NNNNNNNN \
    --extract-method=string \
    --umi-separator=":"
```

## Quality Filtering During Extraction

```bash
# Filter by UMI quality
umi_tools extract \
    --stdin=R1.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --bc-pattern=NNNNNNNN \
    --quality-filter-threshold=20 \
    --quality-encoding=phred33

# Filter UMIs with N bases
umi_tools extract \
    --stdin=R1.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --bc-pattern=NNNNNNNN \
    --filter-cell-barcode
```

## Deduplication

### Basic Deduplication

```bash
# Must be sorted and indexed first
samtools sort -o aligned_sorted.bam aligned.bam
samtools index aligned_sorted.bam

# Deduplicate
umi_tools dedup \
    --stdin=aligned_sorted.bam \
    --stdout=deduplicated.bam \
    --log=dedup.log
```

### Deduplication Methods

```bash
# Default: directional (recommended for most cases)
umi_tools dedup -I input.bam -S output.bam --method=directional

# Unique: only exact UMI matches (most stringent)
umi_tools dedup -I input.bam -S output.bam --method=unique

# Cluster: network-based clustering
umi_tools dedup -I input.bam -S output.bam --method=cluster

# Adjacency: cluster with adjacency
umi_tools dedup -I input.bam -S output.bam --method=adjacency

# Percentile: for highly duplicated data
umi_tools dedup -I input.bam -S output.bam --method=percentile
```

### Method Selection Guide

| Method | Use Case | Speed |
|--------|----------|-------|
| `directional` | Standard RNA-seq, most cases | Fast |
| `unique` | Very high diversity, PCR-free | Fastest |
| `cluster` | Low diversity, high errors | Slow |
| `adjacency` | Balance of accuracy/speed | Medium |
| `percentile` | Extremely high duplication | Fast |

### Paired-End Deduplication

```bash
# Paired-end mode
umi_tools dedup \
    -I aligned_sorted.bam \
    -S deduplicated.bam \
    --paired

# Use read2 for grouping (for R2-based libraries)
umi_tools dedup \
    -I aligned_sorted.bam \
    -S deduplicated.bam \
    --paired \
    --read2-in-read1
```

### Gene-Level Deduplication

```bash
# Deduplicate per gene (for RNA-seq)
umi_tools dedup \
    -I aligned_sorted.bam \
    -S deduplicated.bam \
    --per-gene \
    --gene-tag=GX

# With GTF file for gene assignment
umi_tools dedup \
    -I aligned_sorted.bam \
    -S deduplicated.bam \
    --per-gene \
    --per-cell \
    --gene-tag=XT
```

## UMI Counting

### Count UMIs per Gene

```bash
# Count table (gene x cell for single-cell)
umi_tools count \
    -I deduplicated.bam \
    -S counts.tsv \
    --per-gene \
    --gene-tag=GX \
    --per-cell \
    --cell-tag=CB

# Wide format (matrix)
umi_tools count \
    -I deduplicated.bam \
    -S counts.tsv \
    --per-gene \
    --gene-tag=GX \
    --wide-format-cell-counts
```

### Count Table Format

```
gene    cell    count
ENSG00000139618    ACGT    15
ENSG00000139618    TGCA    8
ENSG00000141510    ACGT    42
```

## Group UMIs Without Deduplication

```bash
# Add UMI group tag to BAM (BX tag)
umi_tools group \
    -I aligned_sorted.bam \
    -S grouped.bam \
    --group-out=groups.tsv \
    --output-bam
```

## Complete Workflows

### Standard RNA-seq with UMIs

**Goal:** Process UMI-tagged RNA-seq reads from raw FASTQ through alignment to a deduplicated BAM file.

**Approach:** Extract UMIs from read headers with umi_tools, align with STAR, then deduplicate based on UMI + mapping position to produce a PCR-artifact-free BAM.

```bash
#!/bin/bash
set -euo pipefail

SAMPLE=$1
REFERENCE=$2

# 1. Extract UMIs (8bp at start of R1)
umi_tools extract \
    --stdin=${SAMPLE}_R1.fastq.gz \
    --read2-in=${SAMPLE}_R2.fastq.gz \
    --stdout=${SAMPLE}_R1_umi.fastq.gz \
    --read2-out=${SAMPLE}_R2_umi.fastq.gz \
    --bc-pattern=NNNNNNNN

# 2. Align with STAR
STAR --runThreadN 8 \
    --genomeDir $REFERENCE \
    --readFilesIn ${SAMPLE}_R1_umi.fastq.gz ${SAMPLE}_R2_umi.fastq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix ${SAMPLE}_ \
    --outSAMtype BAM SortedByCoordinate

# 3. Index
samtools index ${SAMPLE}_Aligned.sortedByCoord.out.bam

# 4. Deduplicate
umi_tools dedup \
    -I ${SAMPLE}_Aligned.sortedByCoord.out.bam \
    -S ${SAMPLE}_deduplicated.bam \
    --output-stats=${SAMPLE}_dedup_stats \
    --paired

# 5. Index deduplicated BAM
samtools index ${SAMPLE}_deduplicated.bam

echo "Done: ${SAMPLE}_deduplicated.bam"
```

### Single-Cell Workflow (Post-CellRanger)

```bash
# CellRanger output has CB (cell barcode) and UB (UMI) tags
# Deduplicate per cell per gene
umi_tools dedup \
    -I possorted_genome_bam.bam \
    -S deduplicated.bam \
    --per-cell \
    --cell-tag=CB \
    --umi-tag=UB \
    --extract-umi-method=tag \
    --per-gene \
    --gene-tag=GX
```

## Statistics and QC

### Deduplication Stats

```bash
# Generate stats file
umi_tools dedup \
    -I input.bam \
    -S output.bam \
    --output-stats=dedup_stats

# Output files:
# dedup_stats_per_umi_per_position.tsv
# dedup_stats_per_umi.tsv
# dedup_stats_edit_distance.tsv
```

### Interpret Deduplication Rate

```python
import pandas as pd

stats = pd.read_csv('dedup.log', sep='\t', comment='#')
total_reads = stats['total_reads'].iloc[0]
unique_reads = stats['unique_reads'].iloc[0]
dedup_rate = 1 - (unique_reads / total_reads)
print(f'Deduplication rate: {dedup_rate:.1%}')
```

## Performance Tips

```bash
# Increase speed with multiple cores (dedup only)
umi_tools dedup -I input.bam -S output.bam --parallel

# Reduce memory for large files
umi_tools dedup -I input.bam -S output.bam --buffer-whole-contig

# Skip statistics for speed
umi_tools dedup -I input.bam -S output.bam --no-sort-output
```

## Alternative: fastp UMI Handling

For simple UMI extraction during QC:

```bash
# Extract 8bp UMI from R1 to header
fastp -i R1.fq.gz -I R2.fq.gz \
      -o R1_umi.fq.gz -O R2_umi.fq.gz \
      --umi --umi_loc read1 --umi_len 8
```

Note: fastp extracts UMIs but doesn't deduplicate - use umi_tools dedup after alignment.

## Related Skills

- fastp-workflow - Simple UMI extraction during preprocessing
- quality-filtering - QC before UMI extraction
- alignment-files/sam-bam-basics - BAM sorting/indexing required before dedup
- single-cell/preprocessing - scRNA-seq workflows use UMI counting

Related Skills

tcga-bulk-data-preprocessing-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.

single-cell-preprocessing-with-omicverse

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.

post-processing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Extract, analyze, and visualize simulation output data. Use for field extraction, time series analysis, line profiles, statistical summaries, derived quantity computation, result comparison to references, and automated report generation from simulation results.

pdf-processing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.

pdf-processing-pro

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready PDF processing with forms, tables, OCR, validation, and batch operations. Use when working with complex PDF workflows in production environments, processing large volumes of PDFs, or requiring robust error handling and validation.

bio-spatial-transcriptomics-spatial-preprocessing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Quality control, filtering, normalization, and feature selection for spatial transcriptomics data. Calculate QC metrics, filter spots/cells, normalize counts, and identify highly variable genes. Use when filtering and normalizing spatial transcriptomics data.

bio-single-cell-preprocessing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Quality control, filtering, and normalization for single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for calculating QC metrics, filtering cells and genes, normalizing counts, identifying highly variable genes, and scaling data. Use when filtering, normalizing, and selecting features in single-cell data.

bio-ribo-seq-riboseq-preprocessing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Preprocess ribosome profiling data including adapter trimming, size selection, rRNA removal, and alignment. Use when preparing Ribo-seq reads for downstream analysis of translation.

bio-read-sequences

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Read biological sequence files (FASTA, FASTQ, GenBank, EMBL, ABI, SFF) using Biopython Bio.SeqIO. Use when parsing sequence files, iterating multi-sequence files, random access to large files, or high-performance parsing.

bio-read-qc-quality-reports

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Generate and interpret quality reports from FASTQ files using FastQC and MultiQC. Assess per-base quality, adapter content, GC bias, duplication levels, and overrepresented sequences. Use when performing initial QC on raw sequencing data or validating preprocessing results.

bio-read-qc-quality-filtering

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Filter reads by quality scores, length, and N content using Trimmomatic and fastp. Apply sliding window trimming, remove low-quality bases from read ends, and discard reads below thresholds. Use when reads have poor quality tails or require minimum quality for downstream analysis.

bio-read-qc-fastp-workflow

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

All-in-one read preprocessing with fastp including adapter trimming, quality filtering, deduplication, base correction, and HTML report generation. Use when preprocessing Illumina data and wanting a single fast tool instead of separate Cutadapt, Trimmomatic, and FastQC steps.