fastq-analysis-pipeline

Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows.

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

fastq-analysis-pipeline is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using fastq-analysis-pipeline should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/fastq-analysis/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/fastq-analysis/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/fastq-analysis/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How fastq-analysis-pipeline Compares

Feature / Agent	fastq-analysis-pipeline	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for ChatGPT

Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.

AI Agent for Product Research

Browse AI agent skills for product research, competitive analysis, customer discovery, and structured product decision support.

AI Agent for SaaS Idea Validation

Use AI agent skills for SaaS idea validation, market research, customer discovery, competitor analysis, and documenting startup hypotheses.

SKILL.md Source

## Overview

OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the `ov.alignment` module. This skill covers:

- **SRA data acquisition**: `prefetch` and `fqdump` (fasterq-dump wrapper)
- **Quality control**: `fastp` for adapter trimming and QC reports
- **RNA-seq alignment**: `STAR` aligner with auto-index building
- **Gene quantification**: `featureCount` (subread featureCounts wrapper)
- **Single-cell path**: `ref` and `count` via kb-python (kallisto/bustools)
- **Parallel SRA download**: `parallel_fastq_dump`

All functions share a common CLI infrastructure (`_cli_utils.py`) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.

## Instructions

1. **Environment setup**
   - Bioinformatics tools are resolved automatically from PATH or the active conda environment.
   - If `auto_install=True` (default), missing tools are installed via mamba/conda on demand.
   - Supported tools: `prefetch`, `vdb-validate`, `fasterq-dump`, `fastp`, `STAR`, `samtools`, `featureCounts`, `pigz`, `gzip`.
   - For the single-cell path, ensure `kb-python` is installed: `pip install kb-python`.

2. **SRA data download** (`ov.alignment.prefetch` + `ov.alignment.fqdump`)
   - Use `prefetch` first for reliable downloads with integrity validation (`vdb-validate`).
   - Then convert to FASTQ with `fqdump`. It auto-detects single-end vs paired-end.
   - `fqdump` can also work directly from SRR accessions without prefetch.
   - Both support retry with exponential backoff for network errors.
   ```python
   import omicverse as ov

   # Step 1: Prefetch SRA files (optional but recommended)
   pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)

   # Step 2: Convert to FASTQ
   fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'],
                             output_dir='fastq', sra_dir='prefetch',
                             gzip=True, threads=8, jobs=4)
   ```

3. **FASTQ quality control** (`ov.alignment.fastp`)
   - Runs fastp for adapter trimming, quality filtering, and QC reporting.
   - Supports single-end and paired-end reads.
   - Produces per-sample JSON and HTML QC reports.
   - Sample format: tuple of `(sample_name, fq1_path, fq2_path_or_None)`.
   ```python
   samples = [
       ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'),
       ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'),
   ]
   clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)
   ```

4. **STAR alignment** (`ov.alignment.STAR`)
   - Aligns FASTQ reads using the STAR aligner.
   - **Auto-index building**: set `auto_index=True` (default) with `genome_fasta_files` and `gtf` to build index automatically if missing.
   - Produces coordinate-sorted BAM files.
   - Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
   - Use `strict=False` (default) for graceful error handling per sample.
   ```python
   # Prepare samples from fastp output
   star_samples = [
       ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'),
       ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'),
   ]
   bams = ov.alignment.STAR(
       star_samples,
       genome_dir='star_index',
       output_dir='star_out',
       gtf='genes.gtf',
       genome_fasta_files=['genome.fa'],
       threads=8,
       memory='50G',
   )
   ```

5. **Gene quantification** (`ov.alignment.featureCount`)
   - Counts aligned reads per gene using featureCounts (subread).
   - Auto-detects paired-end from BAM headers (via pysam or samtools).
   - `auto_fix=True` (default) retries with corrected paired-end flag on error.
   - `gene_mapping=True` maps gene_id to gene_name from the GTF.
   - `merge_matrix=True` produces a combined count matrix across all samples.
   ```python
   bam_items = [
       ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'),
       ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'),
   ]
   counts = ov.alignment.featureCount(
       bam_items,
       gtf='genes.gtf',
       output_dir='counts',
       gene_mapping=True,
       merge_matrix=True,
       threads=8,
   )
   # counts is a pandas DataFrame (gene_id x samples)
   ```

6. **Single-cell path** (`ov.alignment.ref` + `ov.alignment.count`)
   - Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
   - `ref()` builds a kallisto index and transcript-to-gene mapping.
   - `count()` quantifies single-cell data with barcode/UMI handling.
   - Supports technologies: 10XV2, 10XV3, BULK, and custom.
   - Output formats: h5ad, loom, cellranger MTX.
   ```python
   # Build reference index
   ref_result = ov.alignment.ref(
       index_path='kb_ref/index.idx',
       t2g_path='kb_ref/t2g.txt',
       fasta_paths=['genome.fa'],
       gtf_paths=['genes.gtf'],
       threads=8,
   )

   # Quantify 10x v3 data
   count_result = ov.alignment.count(
       index_path='kb_ref/index.idx',
       t2g_path='kb_ref/t2g.txt',
       technology='10XV3',
       fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'],
       output_path='kb_out',
       h5ad=True,
       filter_barcodes=True,
       threads=8,
   )
   ```

7. **Wiring fastp output into STAR input**
   - fastp output is a list of dicts with keys: `sample`, `clean1`, `clean2`, `json`, `html`.
   - Convert to STAR sample tuples:
   ```python
   star_samples = [
       (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None)
       for r in (clean if isinstance(clean, list) else [clean])
   ]
   ```

8. **Wiring STAR output into featureCount input**
   - STAR output is a list of dicts with keys: `sample`, `bam` (or `error`).
   - Convert to featureCount items:
   ```python
   bam_items = [
       (r['sample'], r['bam'])
       for r in (bams if isinstance(bams, list) else [bams])
       if 'bam' in r
   ]
   ```

9. **Skipping completed steps**
   - All functions check for existing outputs and skip if `overwrite=False` (default).
   - Set `overwrite=True` to force re-execution.

10. **Troubleshooting**
    - If a tool is not found, check `auto_install=True` and that conda/mamba is accessible.
    - For STAR index errors, ensure `genome_fasta_files` points to uncompressed or gzip FASTA files.
    - For featureCounts paired-end detection errors, `auto_fix=True` handles most cases automatically.
    - GTF files can be gzip-compressed; they are auto-decompressed as needed.

## Critical API Reference

### Sample Format Convention

All alignment functions use a consistent sample tuple format:
- **FASTQ samples**: `(sample_name, fq1_path, fq2_path_or_None)`
- **BAM items**: `(sample_name, bam_path)` or `(sample_name, bam_path, is_paired_bool)`
- Single samples can be passed as a single tuple; multiple as a list of tuples.
- When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.

### Auto-installation

```python
# All functions support these parameters:
auto_install=True   # Auto-install missing tools via conda/mamba
overwrite=False     # Skip if outputs already exist
threads=8           # Per-tool thread count
jobs=None           # Concurrent job count (auto-detected from CPU count)
```

## Examples

- **Bulk RNA-seq from SRA**: `prefetch` -> `fqdump` -> `fastp` -> `STAR` -> `featureCount` -> pandas DataFrame
- **Single-cell 10x v3**: `ref` -> `count` with `technology='10XV3'` -> h5ad AnnData
- **Local FASTQ files**: Skip download steps, start directly with `fastp` -> `STAR` -> `featureCount`

## References

- See [reference.md](reference.md) for copy-paste-ready code templates.

Related Skills

tooluniverse-variant-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready VCF processing, variant annotation, mutation analysis, and structural variant (SV/CNV) interpretation for bioinformatics questions. Parses VCF files (streaming, large files), classifies mutation types (missense, nonsense, synonymous, frameshift, splice, intronic, intergenic) and structural variants (deletions, duplications, inversions, translocations), applies VAF/depth/quality/consequence filters, annotates with ClinVar/dbSNP/gnomAD/CADD via ToolUniverse, interprets SV/CNV clinical significance using ClinGen dosage sensitivity scores, computes variant statistics, and generates reports. Solves questions like "What fraction of variants with VAF < 0.3 are missense?", "How many non-reference variants remain after filtering intronic/intergenic?", "What is the pathogenicity of this deletion affecting BRCA1?", or "Which dosage-sensitive genes overlap this CNV?". Use when processing VCF files, annotating variants, filtering by VAF/depth/consequence, classifying mutations, interpreting structural variants, assessing CNV pathogenicity, comparing cohorts, or answering variant analysis questions.

tooluniverse-structural-variant-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Comprehensive structural variant (SV) analysis skill for clinical genomics. Classifies SVs (deletions, duplications, inversions, translocations), assesses pathogenicity using ACMG-adapted criteria, evaluates gene disruption and dosage sensitivity, and provides clinical interpretation with evidence grading. Use when analyzing CNVs, large deletions/duplications, chromosomal rearrangements, or any structural variants requiring clinical interpretation.

tooluniverse-spatial-omics-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Computational analysis framework for spatial multi-omics data integration. Given spatially variable genes (SVGs), spatial domain annotations, tissue type, and disease context from spatial transcriptomics/proteomics experiments (10x Visium, MERFISH, DBiTplus, SLIDE-seq, etc.), performs comprehensive biological interpretation including pathway enrichment, cell-cell interaction inference, druggable target identification, immune microenvironment characterization, and multi-modal integration. Produces a detailed markdown report with Spatial Omics Integration Score (0-100), domain-by-domain characterization, and validation recommendations. Uses 70+ ToolUniverse tools across 9 analysis phases. Use when users ask about spatial transcriptomics analysis, spatial omics interpretation, tissue heterogeneity, spatial gene expression patterns, tumor microenvironment mapping, tissue zonation, or cell-cell communication from spatial data.

tooluniverse-proteomics-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Analyze mass spectrometry proteomics data including protein quantification, differential expression, post-translational modifications (PTMs), and protein-protein interactions. Processes MaxQuant, Spectronaut, DIA-NN, and other MS platform outputs. Performs normalization, statistical analysis, pathway enrichment, and integration with transcriptomics. Use when analyzing proteomics data, comparing protein abundance between conditions, identifying PTM changes, studying protein complexes, integrating protein and RNA data, discovering protein biomarkers, or conducting quantitative proteomics experiments.

protein-interaction-network-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Analyze protein-protein interaction networks using STRING, BioGRID, and SASBDB databases. Maps protein identifiers, retrieves interaction networks with confidence scores, performs functional enrichment analysis (GO/KEGG/Reactome), and optionally includes structural data. No API key required for core functionality (STRING). Use when analyzing protein networks, discovering interaction partners, identifying functional modules, or studying protein complexes.

tooluniverse-metabolomics-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Analyze metabolomics data including metabolite identification, quantification, pathway analysis, and metabolic flux. Processes LC-MS, GC-MS, NMR data from targeted and untargeted experiments. Performs normalization, statistical analysis, pathway enrichment, metabolite-enzyme integration, and biomarker discovery. Use when analyzing metabolomics datasets, identifying differential metabolites, studying metabolic pathways, integrating with transcriptomics/proteomics, discovering metabolic biomarkers, performing flux balance analysis, or characterizing metabolic phenotypes in disease, drug response, or physiological conditions.

tooluniverse-immune-repertoire-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Comprehensive immune repertoire analysis for T-cell and B-cell receptor sequencing data. Analyze TCR/BCR repertoires to assess clonality, diversity, V(D)J gene usage, CDR3 characteristics, convergence, and predict epitope specificity. Integrate with single-cell data for clonotype-phenotype associations. Use for adaptive immune response profiling, cancer immunotherapy research, vaccine response assessment, autoimmune disease studies, or repertoire diversity analysis in immunology research.

tooluniverse-image-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready microscopy image analysis and quantitative imaging data skill for colony morphometry, cell counting, fluorescence quantification, and statistical analysis of imaging-derived measurements. Processes ImageJ/CellProfiler output (area, circularity, intensity, cell counts), performs Dunnett's test, Cohen's d effect size, power analysis, Shapiro-Wilk normality tests, two-way ANOVA, polynomial regression, natural spline regression with confidence intervals, and comparative morphometry. Supports CSV/TSV measurement tables, multi-channel fluorescence data, colony swarming assays, and neuron counting datasets. Use when analyzing microscopy measurement data, colony area/circularity, cell count statistics, swarming assays, co-culture ratio optimization, or answering questions about imaging-derived quantitative data.

tooluniverse-crispr-screen-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Comprehensive CRISPR screen analysis for functional genomics. Analyze pooled or arrayed CRISPR screens (knockout, activation, interference) to identify essential genes, synthetic lethal interactions, and drug targets. Perform sgRNA count processing, gene-level scoring (MAGeCK, BAGEL), quality control, pathway enrichment, and drug target prioritization. Use for CRISPR screen analysis, gene essentiality studies, synthetic lethality detection, functional genomics, drug target validation, or identifying genetic vulnerabilities.

statistical-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Statistical analysis toolkit. Hypothesis tests (t-test, ANOVA, chi-square), regression, correlation, Bayesian stats, power analysis, assumption checks, APA reporting, for academic research.

single-trajectory-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Guide to reproducing OmicVerse trajectory workflows spanning PAGA, Palantir, VIA, velocity coupling, and fate scoring notebooks.

single-cell-downstream-analysis

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Checklist-style reference for OmicVerse downstream tutorials covering AUCell scoring, metacell DEG, and related exports.