bio-basecalling

Convert raw Nanopore signal data (FAST5/POD5) to nucleotide sequences using Dorado basecaller. Covers model selection, GPU acceleration, modified base detection, and quality filtering. Use when processing raw Nanopore data before alignment. Note: Guppy is deprecated; use Dorado for all new analyses.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

bio-basecalling is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using bio-basecalling should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-basecalling/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/bio-basecalling/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bio-basecalling/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bio-basecalling Compares

Feature / Agent	bio-basecalling	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Nanopore Basecalling

Convert raw electrical signal from Nanopore sequencing into nucleotide sequences.

## Dorado (Recommended)

Dorado is ONT's current production basecaller, replacing Guppy. It offers better accuracy and speed.

### Basic Basecalling

```bash
dorado basecaller sup pod5_dir/ > calls.bam
```

### Choose Model

```bash
dorado basecaller fast pod5_dir/ > calls.bam
dorado basecaller hac pod5_dir/ > calls.bam
dorado basecaller sup pod5_dir/ > calls.bam
```

### Model Speed vs Accuracy

| Model | Speed | Accuracy | Use Case |
|-------|-------|----------|----------|
| fast | Fastest | Lower | Quick preview |
| hac | Medium | High | General use |
| sup | Slowest | Highest | Publication quality |

### Specific Model Version

```bash
dorado download --model dna_r10.4.1_e8.2_400bps_sup@v5.1.0
dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v5.1.0 pod5_dir/ > calls.bam
```

### List Available Models

```bash
dorado download --list
```

### Output FASTQ Instead of BAM

```bash
dorado basecaller sup pod5_dir/ --emit-fastq > calls.fastq
```

### Modified Base Detection

```bash
dorado basecaller sup,5mCG_5hmCG pod5_dir/ > calls_mods.bam
dorado basecaller sup,5mCG pod5_dir/ > calls_5mc.bam
dorado basecaller sup,6mA pod5_dir/ > calls_6ma.bam
```

### GPU Selection

```bash
dorado basecaller sup pod5_dir/ --device cuda:0 > calls.bam
dorado basecaller sup pod5_dir/ --device cuda:0,1 > calls.bam
dorado basecaller sup pod5_dir/ --device cpu > calls.bam
```

### Batch Size for Memory

```bash
dorado basecaller sup pod5_dir/ --batchsize 64 > calls.bam
```

### Duplex Calling

```bash
dorado duplex sup pod5_dir/ > duplex.bam
```

### Demultiplexing During Basecalling

```bash
dorado basecaller sup pod5_dir/ --kit-name SQK-NBD114-24 > calls.bam
dorado demux calls.bam --output-dir demuxed/ --kit-name SQK-NBD114-24
```

### Trim Adapters

```bash
dorado basecaller sup pod5_dir/ --trim adapters > calls.bam
dorado basecaller sup pod5_dir/ --no-trim > calls_untrimmed.bam
```

### Resume Interrupted Run

```bash
dorado basecaller sup pod5_dir/ --resume-from calls.bam > calls_complete.bam
```

## Guppy (Deprecated - Legacy Only)

Guppy is deprecated and no longer receiving updates. Use Dorado for all new analyses. Guppy examples below are only for maintaining legacy pipelines.

### Basic Basecalling

```bash
guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_sup.cfg \
    --device cuda:0
```

### CPU Mode

```bash
guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_fast.cfg \
    --num_callers 8 \
    --cpu_threads_per_caller 4
```

### High Accuracy Model

```bash
guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_hac.cfg \
    --device cuda:0
```

### Super Accuracy Model

```bash
guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_sup.cfg \
    --device cuda:0
```

### List Available Configs

```bash
guppy_basecaller --print_workflows
ls /opt/ont/guppy/data/*.cfg
```

### Modified Base Calling

```bash
guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_sup.cfg \
    --device cuda:0
```

### Barcoding During Basecalling

```bash
guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_sup.cfg \
    --device cuda:0 \
    --barcode_kits SQK-NBD114-24
```

### Output BAM

```bash
guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_sup.cfg \
    --device cuda:0 \
    --bam_out \
    --index
```

## POD5 File Handling

POD5 is the new format replacing FAST5.

### Convert FAST5 to POD5

```bash
pod5 convert fast5 fast5_dir/*.fast5 --output pod5_dir/
```

### Merge POD5 Files

```bash
pod5 merge pod5_dir/*.pod5 --output merged.pod5
```

### Inspect POD5

```bash
pod5 inspect reads input.pod5
pod5 inspect summary input.pod5
```

### Subset POD5

```bash
pod5 subset input.pod5 --output subset.pod5 --read-id-file read_ids.txt
```

## Quality Filtering

### Filter with Chopper (After Basecalling)

```bash
gunzip -c calls.fastq.gz | chopper -q 10 -l 500 | gzip > filtered.fastq.gz
```

### Filter by Quality Score

```bash
gunzip -c calls.fastq.gz | \
    awk 'BEGIN{OFS="\n"} {h=$0; getline seq; getline plus; getline qual;
         split(h, a, " "); split(a[4], q, "=");
         if(q[2] >= 10) print h, seq, plus, qual}' | \
    gzip > q10_filtered.fastq.gz
```

### NanoFilt (Alternative)

```bash
gunzip -c calls.fastq.gz | NanoFilt -q 10 -l 500 | gzip > filtered.fastq.gz
```

## Basecalling QC

### NanoPlot

```bash
NanoPlot --fastq calls.fastq.gz -o qc_report/ --plots hex dot
NanoPlot --bam calls.bam -o qc_report/
```

### pycoQC (From Sequencing Summary)

```bash
pycoQC -f sequencing_summary.txt -o pycoqc_report.html
```

### Basic Stats

```bash
seqkit stats calls.fastq.gz

awk 'NR%4==2 {sum+=length($0); count++} END {print "Reads:", count, "Mean length:", sum/count}' calls.fastq
```

## Model Selection Guide

### R10.4.1 Chemistry (Current)

| Model | Use |
|-------|-----|
| dna_r10.4.1_e8.2_400bps_fast | Quick analysis |
| dna_r10.4.1_e8.2_400bps_hac | Routine work |
| dna_r10.4.1_e8.2_400bps_sup | High accuracy |

### R9.4.1 Chemistry (Legacy)

| Model | Use |
|-------|-----|
| dna_r9.4.1_450bps_fast | Quick analysis |
| dna_r9.4.1_450bps_hac | Routine work |
| dna_r9.4.1_450bps_sup | High accuracy |

## Complete Pipeline

```bash
#!/bin/bash
INPUT=$1
OUTPUT=$2
MODEL=${3:-sup}

mkdir -p $OUTPUT

if [ -d "$INPUT/fast5" ]; then
    echo "Converting FAST5 to POD5..."
    pod5 convert fast5 $INPUT/fast5/*.fast5 --output $OUTPUT/pod5/
    INPUT_DIR="$OUTPUT/pod5"
else
    INPUT_DIR="$INPUT"
fi

echo "Basecalling with $MODEL model..."
dorado basecaller $MODEL $INPUT_DIR > $OUTPUT/calls.bam

echo "Converting to FASTQ..."
samtools fastq $OUTPUT/calls.bam | gzip > $OUTPUT/calls.fastq.gz

echo "Filtering..."
gunzip -c $OUTPUT/calls.fastq.gz | chopper -q 10 -l 500 | gzip > $OUTPUT/filtered.fastq.gz

echo "QC report..."
NanoPlot --fastq $OUTPUT/filtered.fastq.gz -o $OUTPUT/qc/

echo "Done!"
```

## GPU Requirements

| Model | VRAM Required | Speed (R10.4.1) |
|-------|--------------|-----------------|
| fast | 4 GB | ~450 bases/s |
| hac | 8 GB | ~200 bases/s |
| sup | 12 GB | ~50 bases/s |

## Troubleshooting

### Out of Memory

```bash
dorado basecaller sup pod5_dir/ --batchsize 32 > calls.bam
```

### Slow CPU Basecalling

```bash
dorado basecaller fast pod5_dir/ --device cpu > calls.bam
```

### Check GPU Usage

```bash
nvidia-smi -l 1
watch -n 1 nvidia-smi
```

## Related Skills

- **long-read-alignment** - Align basecalled reads
- **long-read-qc** - QC after basecalling
- **medaka-polishing** - Polish using basecalled reads
- **structural-variants** - SV detection from long reads

Related Skills

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

large-data-with-dask

from diegosouzapw/awesome-omni-skill

Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.

langsmith-fetch

from diegosouzapw/awesome-omni-skill

Debug LangChain and LangGraph agents by fetching execution traces from LangSmith Studio. Use when debugging agent behavior, investigating errors, analyzing tool calls, checking memory operations, or examining agent performance. Automatically fetches recent traces and analyzes execution patterns. Requires langsmith-fetch CLI installed.

langchain-tool-calling

from diegosouzapw/awesome-omni-skill

How chat models call tools - includes bind_tools, tool choice strategies, parallel tool calling, and tool message handling

langchain-notes

from diegosouzapw/awesome-omni-skill

LangChain 框架学习笔记 - 快速查找概念、代码示例和最佳实践。包含 Core components、Middleware、Advanced usage、Multi-agent patterns、RAG retrieval、Long-term memory 等主题。当用户询问 LangChain、Agent、RAG、向量存储、工具使用、记忆系统时使用此 Skill。

langchain-js

from diegosouzapw/awesome-omni-skill

Builds LLM-powered applications with LangChain.js for chat, agents, and RAG. Use when creating AI applications with chains, memory, tools, and retrieval-augmented generation in JavaScript.

langchain-agents

from diegosouzapw/awesome-omni-skill

Expert guidance for building LangChain agents with proper tool binding, memory, and configuration. Use when creating agents, configuring models, or setting up tool integrations in LangConfig.

lang-python

from diegosouzapw/awesome-omni-skill

Python 3.13+ development specialist covering FastAPI, Django, async patterns, data science, testing with pytest, and modern Python features. Use when developing Python APIs, web applications, data pipelines, or writing tests.

kramme:agents-md

from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "update AGENTS.md", "add to AGENTS.md", "maintain agent docs", or needs to add guidelines to agent instructions. Guides discovery of local skills and enforces structured, keyword-based documentation style.

kontent-ai-automation

from diegosouzapw/awesome-omni-skill

Automate Kontent AI tasks via Rube MCP (Composio). Always search tools first for current schemas.

kitt-create-slash-commands

from diegosouzapw/awesome-omni-skill

Expert guidance for creating slash commands. Use when working with slash commands, creating custom commands, understanding command structure, or learning YAML configuration.

kitt-create-plans

from diegosouzapw/awesome-omni-skill

Create hierarchical project plans optimized for solo agentic development. Use when planning projects, phases, or tasks that the AI agent will execute. Produces agent-executable plans with verification criteria, not enterprise documentation. Handles briefs, roadmaps, phase plans, and context handoffs.