bio-format-conversion

Convert between sequence file formats (FASTA, FASTQ, GenBank, EMBL) using Biopython Bio.SeqIO. Use when changing file formats or preparing data for different tools.

1,802 stars

Best use case

bio-format-conversion is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Convert between sequence file formats (FASTA, FASTQ, GenBank, EMBL) using Biopython Bio.SeqIO. Use when changing file formats or preparing data for different tools.

Teams using bio-format-conversion should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-format-conversion/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-format-conversion/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/bio-format-conversion/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How bio-format-conversion Compares

Feature / Agentbio-format-conversionStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Convert between sequence file formats (FASTA, FASTQ, GenBank, EMBL) using Biopython Bio.SeqIO. Use when changing file formats or preparing data for different tools.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Version Compatibility

Reference examples tested with: BioPython 1.83+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Format Conversion

**"Convert this file to a different format"** → Read records in one format, optionally add missing annotations, and write in the target format.
- Python: `SeqIO.convert()` for direct conversion, or `SeqIO.parse()` + `SeqIO.write()` when modifications are needed (BioPython)
- CLI: `seqkit seq` (SeqKit) for FASTA/FASTQ; `samtools view` for SAM/BAM/CRAM

Convert sequence files between formats using Biopython's Bio.SeqIO module.

## Required Import

```python
from Bio import SeqIO
```

## Core Function

### SeqIO.convert() - Direct Conversion
Convert between formats in a single call. Most efficient method.

```python
count = SeqIO.convert('input.gb', 'genbank', 'output.fasta', 'fasta')
print(f'Converted {count} records')
```

**Parameters:**
- `in_file` - Input filename or handle
- `in_format` - Input format string
- `out_file` - Output filename or handle
- `out_format` - Output format string

**Returns:** Number of records converted

## Common Conversions

| From | To | Notes |
|------|-----|-------|
| GenBank | FASTA | Loses annotations, keeps sequence |
| FASTA | GenBank | Need to add molecule_type |
| FASTQ | FASTA | Loses quality scores |
| FASTA | FASTQ | Need to add quality scores |
| GenBank | EMBL | Usually works directly |
| Stockholm | FASTA | Alignment to sequences |

## Code Patterns

### Simple Conversion
```python
SeqIO.convert('input.gb', 'genbank', 'output.fasta', 'fasta')
```

### GenBank to FASTA
```python
SeqIO.convert('sequence.gb', 'genbank', 'sequence.fasta', 'fasta')
```

### FASTQ to FASTA (drop quality)
```python
SeqIO.convert('reads.fastq', 'fastq', 'reads.fasta', 'fasta')
```

### FASTA to GenBank (requires molecule_type)

**Goal:** Convert FASTA to GenBank format, which requires molecule_type annotation.

**Approach:** Stream records through a generator that injects the missing annotation, then write.

**Reference (BioPython 1.83+):**
```python
records = SeqIO.parse('input.fasta', 'fasta')
def add_molecule_type(records):
    for record in records:
        record.annotations['molecule_type'] = 'DNA'
        yield record

SeqIO.write(add_molecule_type(records), 'output.gb', 'genbank')
```

### FASTA to FASTQ (add dummy quality)

**Goal:** Convert FASTA to FASTQ by assigning uniform placeholder quality scores.

**Approach:** Stream records through a generator that adds phred_quality to each, then write as FASTQ.

**Reference (BioPython 1.83+):**
```python
def add_quality(records, quality=30):
    for record in records:
        record.letter_annotations['phred_quality'] = [quality] * len(record.seq)
        yield record

records = SeqIO.parse('input.fasta', 'fasta')
SeqIO.write(add_quality(records), 'output.fastq', 'fastq')
```

### Batch Convert Multiple Files

**Goal:** Convert all files of one format in a directory to another format.

**Approach:** Glob for input files, apply `SeqIO.convert()` to each, and report per-file counts.

**Reference (BioPython 1.83+):**
```python
from pathlib import Path

for gb_file in Path('.').glob('*.gb'):
    fasta_file = gb_file.with_suffix('.fasta')
    count = SeqIO.convert(str(gb_file), 'genbank', str(fasta_file), 'fasta')
    print(f'{gb_file.name}: {count} records')
```

### Convert with Modifications
```python
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

def uppercase_record(rec):
    return SeqRecord(rec.seq.upper(), id=rec.id, description=rec.description)

records = SeqIO.parse('input.fasta', 'fasta')
modified = (uppercase_record(rec) for rec in records)
SeqIO.write(modified, 'output.fasta', 'fasta')
```

### Alignment Format Conversion
```python
from Bio import AlignIO

AlignIO.convert('alignment.sto', 'stockholm', 'alignment.phy', 'phylip')
```

## Format Compatibility Matrix

**Can convert directly (no modifications needed):**
- GenBank <-> EMBL
- FASTA -> any format (may need annotations added)
- Any format -> FASTA (always works, may lose data)
- FASTQ -> FASTA

**Requires adding data:**
- FASTA -> FASTQ (need quality scores)
- FASTA -> GenBank (need molecule_type)

**May lose data:**
- GenBank -> FASTA (loses features, annotations)
- FASTQ -> FASTA (loses quality scores)
- Any rich format -> FASTA

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `ValueError: missing molecule_type` | FASTA to GenBank | Add molecule_type annotation |
| `ValueError: missing quality scores` | FASTA to FASTQ | Add phred_quality to letter_annotations |
| `KeyError: 'phred_quality'` | Wrong FASTQ variant | Try 'fastq-sanger', 'fastq-illumina' |

## Decision Tree

```
Converting formats?
├── Simple conversion (no data changes)?
│   └── Use SeqIO.convert() directly
├── Need to add annotations?
│   └── Parse, modify records, then write
├── Need to transform sequences?
│   └── Parse, apply transformation, then write
└── Multiple files?
    └── Loop with SeqIO.convert() or batch generator
```

## Related Skills

- read-sequences - Parse sequences for custom conversion logic
- write-sequences - Write converted sequences with modifications
- batch-processing - Convert multiple files at once
- compressed-files - Handle compressed input/output during conversion
- alignment-files - For SAM/BAM/CRAM conversion, use samtools view

Related Skills

mcpmed-bioinformatics-server

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Model Context Protocol (MCP) server for bioinformatics web services like GEO, STRING, and UCSC Cell Browser.

lobster-bioinformatics

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Run bioinformatics analyses using Lobster AI - single-cell RNA-seq, bulk RNA-seq, literature mining, dataset discovery, quality control, and visualization. Use when analyzing genomics data, searching for papers/datasets, or working with H5AD, CSV, GEO/SRA accessions, or biological data. Requires lobster-ai package installed.

bio-immunoinformatics-tcr-epitope-binding

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Predict TCR-epitope specificity using ERGO-II and deep learning models for T-cell receptor antigen recognition. Match TCRs to their cognate epitopes or predict TCR targets. Use when analyzing TCR repertoire specificity or identifying antigen-reactive T-cells.

bio-immunoinformatics-neoantigen-prediction

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Identify tumor neoantigens from somatic mutations using pVACtools for personalized cancer immunotherapy. Predict mutant peptides that bind patient HLA and may elicit T-cell responses. Use when identifying vaccine targets or checkpoint inhibitor response biomarkers from tumor sequencing data.

bio-immunoinformatics-mhc-binding-prediction

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Predict peptide-MHC class I and II binding affinity using MHCflurry and NetMHCpan neural network models. Identify potential T-cell epitopes from protein sequences. Use when predicting MHC binding for vaccine design or neoantigen identification.

bio-immunoinformatics-immunogenicity-scoring

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Score and prioritize neoantigens and epitopes for immunogenicity using multi-factor models combining MHC binding, processing, expression, and sequence features. Rank candidates for vaccine design. Use when prioritizing epitopes for vaccine development or identifying the most immunogenic neoantigens.

bio-immunoinformatics-epitope-prediction

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Predict B-cell and T-cell epitopes using BepiPred, IEDB tools, and structure-based methods for vaccine and antibody design. Identify immunogenic regions in antigens. Use when designing vaccines, mapping antibody binding sites, or predicting immunogenic peptides.

bio-flow-cytometry-compensation-transformation

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Spillover compensation and data transformation for flow cytometry. Covers compensation matrix calculation, application, and biexponential/arcsinh transforms. Use when correcting spectral overlap between fluorophores or transforming data for analysis.

zinc-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

zarr-python

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

xlsx

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

writing-skills

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Use when creating new skills, editing existing skills, or verifying skills work before deployment