bio-read-qc-contamination-screening

Detect sample contamination and cross-species reads using FastQ Screen. Screen reads against multiple reference genomes to identify bacterial, viral, adapter, or sample swap contamination. Use when suspecting cross-contamination or working with samples prone to microbial contamination.

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

bio-read-qc-contamination-screening is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using bio-read-qc-contamination-screening should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-read-qc-contamination-screening/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-read-qc-contamination-screening/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bio-read-qc-contamination-screening/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bio-read-qc-contamination-screening Compares

Feature / Agent	bio-read-qc-contamination-screening	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Version Compatibility

Reference examples tested with: BBTools 39.0+, Bowtie2 2.5.3+, FastQ Screen 0.15+, FastQC 0.12+, MultiQC 1.21+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Contamination Screening

Screen FASTQ files against multiple genomes to identify contamination sources using FastQ Screen.

**"Check for contamination in sequencing data"** → Align a sample of reads against multiple reference genomes to identify cross-species or cross-sample contamination.
- CLI: `fastq_screen --conf fastq_screen.conf reads.fq`

## FastQ Screen Overview

FastQ Screen aligns a subset of reads against multiple reference genomes to identify:
- Cross-species contamination
- Bacterial/viral contamination
- Adapter sequences
- PhiX spike-in
- Sample swaps

## Basic Usage

```bash
# Screen against configured genomes
fastq_screen sample.fastq.gz

# Multiple files
fastq_screen *.fastq.gz

# Specify output directory
fastq_screen --outdir qc_results/ sample.fastq.gz

# Custom config file
fastq_screen --conf my_screen.conf sample.fastq.gz
```

## Configuration File

Create `fastq_screen.conf`:

```
# Database locations
DATABASE	Human	/path/to/human/genome
DATABASE	Mouse	/path/to/mouse/genome
DATABASE	Ecoli	/path/to/ecoli/genome
DATABASE	PhiX	/path/to/phix/genome
DATABASE	Adapters	/path/to/adapters
DATABASE	rRNA	/path/to/rrna

# Aligner (bowtie2 recommended)
BOWTIE2	/path/to/bowtie2

# Or use BWA
# BWA	/path/to/bwa

# Threads
THREADS	8
```

### Pre-built Databases

```bash
# Download common screening databases
fastq_screen --get_genomes

# Downloads to ~/fastq_screen_databases/
# Includes: Human, Mouse, Rat, E.coli, PhiX, Adapters, etc.
```

## Screening Options

```bash
# Number of reads to sample (default 100000)
fastq_screen --subset 200000 sample.fastq.gz

# Use all reads (slow)
fastq_screen --subset 0 sample.fastq.gz

# Set threads
fastq_screen --threads 8 sample.fastq.gz

# Paired-end (screen R1 only by default)
fastq_screen sample_R1.fastq.gz

# Force screening both pairs
fastq_screen --paired sample_R1.fastq.gz sample_R2.fastq.gz
```

## Output Options

```bash
# Generate PNG plot (default)
fastq_screen sample.fastq.gz

# No plot (text only)
fastq_screen --nograph sample.fastq.gz

# Generate additional mapping statistics
fastq_screen --tag sample.fastq.gz

# Filter reads by mapping (keep unmapped to all genomes)
fastq_screen --filter 0000 sample.fastq.gz

# Keep only reads mapping to first genome (e.g., Human)
fastq_screen --filter 1--- sample.fastq.gz
```

## Filter Codes

Use `--filter` to select reads based on mapping status:

| Code | Meaning |
|------|---------|
| 0 | Did not map to genome |
| 1 | Mapped uniquely |
| 2 | Mapped more than once |
| 3 | Mapped (unique or multi) |
| - | Ignore this genome |

```bash
# Example: Keep reads mapping only to Human (first genome)
# Human:1, all others:0
fastq_screen --filter 10000 sample.fastq.gz

# Keep reads NOT mapping to anything (clean reads)
fastq_screen --filter 00000 sample.fastq.gz
```

## Output Files

| File | Description |
|------|-------------|
| `*_screen.txt` | Tab-delimited results |
| `*_screen.png` | Visualization |
| `*_screen.html` | HTML report |

### Results Format

```
#Fastq_screen version: 0.15.3
Genome	#Reads_processed	#Unmapped	%Unmapped	#One_hit_one_genome	%One_hit_one_genome	#Multiple_hits_one_genome	%Multiple_hits_one_genome	#One_hit_multiple_genomes	%One_hit_multiple_genomes	Multiple_hits_multiple_genomes	%Multiple_hits_multiple_genomes
Human	100000	2000	2.00	95000	95.00	1000	1.00	1500	1.50	500	0.50
Mouse	100000	98000	98.00	100	0.10	50	0.05	1500	1.50	350	0.35
```

## Interpreting Results

### Expected Results by Sample Type

| Sample Type | Expected Pattern |
|-------------|------------------|
| Human sample | >90% Human, <1% others |
| Mouse sample | >90% Mouse, <1% others |
| Human + PhiX | >80% Human, ~10% PhiX |
| Contaminated | Significant % to unexpected genome |

### Common Issues

| Pattern | Likely Cause |
|---------|--------------|
| High adapter % | Library prep issue |
| High PhiX % | Spike-in not removed |
| High E.coli % | Bacterial contamination |
| High rRNA % | rRNA depletion failed |
| Multiple species | Sample swap or contamination |

## MultiQC Integration

FastQ Screen results are automatically detected by MultiQC:

```bash
# Screen all samples
for f in *.fastq.gz; do
    fastq_screen --outdir screen_results/ "$f"
done

# Aggregate with MultiQC
multiqc screen_results/
```

## Custom Database Setup

### Create Bowtie2 Index

```bash
# Index a FASTA file
bowtie2-build reference.fa reference

# Add to config
# DATABASE	MyGenome	/path/to/reference
```

### Common Databases to Include

| Genome | Purpose |
|--------|---------|
| Human (GRCh38) | Human samples |
| Mouse (GRCm39) | Mouse samples |
| E. coli | Bacterial contamination |
| PhiX | Illumina spike-in |
| Adapters | Library prep |
| rRNA | Ribosomal RNA |
| Vectors | Cloning vectors |
| Mycoplasma | Cell culture contamination |

## Example Workflows

### Standard Screening

```bash
# Download databases
fastq_screen --get_genomes

# Screen samples
fastq_screen --outdir screen_results/ --threads 8 *.fastq.gz

# Check results
multiqc screen_results/
```

### Remove Contamination

```bash
# Screen and tag reads
fastq_screen --tag sample.fastq.gz

# Filter to keep only Human reads (assuming Human is first database)
fastq_screen --filter 3----- --tag sample.fastq.gz

# Or use BBDuk for removal
bbduk.sh in=sample.fastq.gz out=clean.fastq.gz \
    ref=contaminants.fa k=31 hdist=1
```

## Related Skills

- quality-reports - FastQC shows overrepresented sequences
- adapter-trimming - Remove adapter contamination
- metagenomics/kraken-classification - Deeper taxonomic analysis

Related Skills

bio-virtual-screening

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Performs structure-based virtual screening using AutoDock Vina 1.2 for molecular docking. Prepares receptor PDBQT files, generates ligand conformers, defines binding site boxes, and ranks compounds by predicted binding affinity. Use when screening chemical libraries against a protein structure to find potential binders.

bio-read-sequences

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Read biological sequence files (FASTA, FASTQ, GenBank, EMBL, ABI, SFF) using Biopython Bio.SeqIO. Use when parsing sequence files, iterating multi-sequence files, random access to large files, or high-performance parsing.

bio-read-qc-umi-processing

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Extract, process, and deduplicate reads using Unique Molecular Identifiers (UMIs) with umi_tools. Use when library prep includes UMIs and accurate molecule counting is needed, such as in single-cell RNA-seq, low-input RNA-seq, or targeted sequencing to distinguish PCR from biological duplicates.

bio-read-qc-quality-reports

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Generate and interpret quality reports from FASTQ files using FastQC and MultiQC. Assess per-base quality, adapter content, GC bias, duplication levels, and overrepresented sequences. Use when performing initial QC on raw sequencing data or validating preprocessing results.

bio-read-qc-quality-filtering

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Filter reads by quality scores, length, and N content using Trimmomatic and fastp. Apply sliding window trimming, remove low-quality bases from read ends, and discard reads below thresholds. Use when reads have poor quality tails or require minimum quality for downstream analysis.

bio-read-qc-fastp-workflow

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

All-in-one read preprocessing with fastp including adapter trimming, quality filtering, deduplication, base correction, and HTML report generation. Use when preprocessing Illumina data and wanting a single fast tool instead of separate Cutadapt, Trimmomatic, and FastQC steps.

bio-read-qc-adapter-trimming

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Remove sequencing adapters from FASTQ files using Cutadapt and Trimmomatic. Supports single-end and paired-end reads, Illumina TruSeq, Nextera, and custom adapter sequences. Use when FastQC shows adapter contamination or before alignment of short reads.

bio-longread-structural-variants

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Detect structural variants from long-read alignments using Sniffles, cuteSV, and SVIM. Use when detecting deletions, insertions, inversions, translocations, or complex rearrangements from ONT or PacBio data, especially those missed by short-read methods.

bio-longread-qc

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Quality control for long-read sequencing data using NanoPlot, NanoStat, and chopper. Generate QC reports, filter reads by length and quality, and visualize read characteristics. Use when assessing ONT or PacBio run quality or filtering reads before assembly or alignment.

bio-longread-medaka

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Polish assemblies and call variants from Oxford Nanopore data using medaka. Uses neural networks trained on specific basecaller versions. Use when improving ONT-only assemblies or calling variants from Nanopore data without short-read polishing.

bio-longread-alignment

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Align long reads using minimap2 for Oxford Nanopore and PacBio data. Supports various presets for different read types and applications. Use when aligning ONT or PacBio reads to a reference genome for variant calling, SV detection, or coverage analysis.

bio-long-read-sequencing-nanopore-methylation

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Calls DNA methylation from Oxford Nanopore sequencing data using signal-level analysis. Use when detecting 5mC or 6mA modifications directly from nanopore reads without bisulfite conversion.