bio-compressed-files

Read and write compressed sequence files (gzip, bzip2, BGZF) using Biopython. Use when working with .gz or .bz2 sequence files. Use BGZF for indexable compressed files.

1,802 stars

byFreedomIntelligence

View on GitHub Installation ↓

Best use case

bio-compressed-files is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Read and write compressed sequence files (gzip, bzip2, BGZF) using Biopython. Use when working with .gz or .bz2 sequence files. Use BGZF for indexable compressed files.

Teams using bio-compressed-files should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bio-compressed-files/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/bio-compressed-files/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bio-compressed-files/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bio-compressed-files Compares

Feature / Agent	bio-compressed-files	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Read and write compressed sequence files (gzip, bzip2, BGZF) using Biopython. Use when working with .gz or .bz2 sequence files. Use BGZF for indexable compressed files.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Version Compatibility

Reference examples tested with: BioPython 1.83+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Compressed Files

Handle gzip, bzip2, and BGZF compressed sequence files with Biopython.

**"Read a compressed sequence file"** → Open a compressed file handle in text mode, then parse with the standard SeqIO interface.
- gzip: `gzip.open(path, 'rt')` (Python stdlib)
- bzip2: `bz2.open(path, 'rt')` (Python stdlib)
- BGZF: `bgzf.open(path, 'rt')` (BioPython) or direct `SeqIO.parse(path, fmt)`

**"Make a compressed file indexable"** → Convert to BGZF format. Only BGZF supports `SeqIO.index()` on compressed data.

## Required Imports

```python
import gzip
import bz2
from Bio import SeqIO
from Bio import bgzf
```

## Reading Compressed Files

**Goal:** Parse sequence records from compressed files without decompressing to disk.

**Approach:** Open a decompression handle in text mode (`'rt'`), then pass the handle to `SeqIO.parse()`. The parser works identically to uncompressed input.

### Gzip (.gz) (BioPython 1.83+)
```python
with gzip.open('sequences.fasta.gz', 'rt') as handle:
    for record in SeqIO.parse(handle, 'fasta'):
        print(record.id, len(record.seq))
```

**Important:** Use `'rt'` (read text) mode, not `'rb'` (read binary).

### Bzip2 (.bz2) (BioPython 1.83+)
```python
with bz2.open('sequences.fasta.bz2', 'rt') as handle:
    for record in SeqIO.parse(handle, 'fasta'):
        print(record.id, len(record.seq))
```

### BGZF (Block Gzip) (BioPython 1.83+)
BGZF files can be read like regular gzip, but also support indexing:

```python
for record in SeqIO.parse('sequences.fasta.bgz', 'fasta'):
    print(record.id)

with bgzf.open('sequences.fasta.bgz', 'rt') as handle:
    for record in SeqIO.parse(handle, 'fasta'):
        print(record.id)
```

## Writing Compressed Files

**Goal:** Save sequence records directly to compressed files without an intermediate uncompressed step.

**Approach:** Open a compression handle in text mode (`'wt'`), then pass it to `SeqIO.write()`.

### Gzip (.gz)
```python
with gzip.open('output.fasta.gz', 'wt') as handle:
    SeqIO.write(records, handle, 'fasta')
```

### Bzip2 (.bz2)
```python
with bz2.open('output.fasta.bz2', 'wt') as handle:
    SeqIO.write(records, handle, 'fasta')
```

### BGZF (.bgz)
```python
with bgzf.open('output.fasta.bgz', 'wt') as handle:
    SeqIO.write(records, handle, 'fasta')
```

## BGZF: Indexable Compression

**Goal:** Enable random access to records in compressed sequence files.

**Approach:** Write sequences in BGZF (Block GZip Format) — the only compressed format supporting `SeqIO.index()` and `SeqIO.index_db()`. BGZF is a gzip variant used by BAM and tabix-indexed files.

### Create Indexable Compressed File

```python
from Bio import SeqIO, bgzf

records = SeqIO.parse('input.fasta', 'fasta')
with bgzf.open('output.fasta.bgz', 'wt') as handle:
    SeqIO.write(records, handle, 'fasta')
```

### Index a BGZF File

```python
records = SeqIO.index('sequences.fasta.bgz', 'fasta')
seq = records['target_id'].seq
records.close()

records = SeqIO.index_db('index.sqlite', 'sequences.fasta.bgz', 'fasta')
```

### Convert Gzip to BGZF

**"Convert gzip to indexable format"** → Parse from gzip handle, write through BGZF handle.

```python
from Bio import SeqIO, bgzf
import gzip

with gzip.open('input.fasta.gz', 'rt') as in_handle:
    with bgzf.open('output.fasta.bgz', 'wt') as out_handle:
        SeqIO.write(SeqIO.parse(in_handle, 'fasta'), out_handle, 'fasta')
```

## Code Patterns

### Read Gzipped FASTQ
```python
with gzip.open('reads.fastq.gz', 'rt') as handle:
    records = list(SeqIO.parse(handle, 'fastq'))
print(f'Loaded {len(records)} reads')
```

### Count Records in Gzipped File
```python
with gzip.open('sequences.fasta.gz', 'rt') as handle:
    count = sum(1 for _ in SeqIO.parse(handle, 'fasta'))
print(f'{count} sequences')
```

### Fast Count with Low-Level Parser
```python
from Bio.SeqIO.FastaIO import SimpleFastaParser
import gzip

with gzip.open('sequences.fasta.gz', 'rt') as handle:
    count = sum(1 for _ in SimpleFastaParser(handle))
```

### Convert Compressed to Uncompressed
```python
with gzip.open('input.fasta.gz', 'rt') as in_handle:
    records = SeqIO.parse(in_handle, 'fasta')
    SeqIO.write(records, 'output.fasta', 'fasta')
```

### Convert Uncompressed to Compressed
```python
records = SeqIO.parse('input.fasta', 'fasta')
with gzip.open('output.fasta.gz', 'wt') as out_handle:
    SeqIO.write(records, out_handle, 'fasta')
```

### Auto-Detect Compression

```python
from pathlib import Path
from Bio import SeqIO, bgzf
import gzip
import bz2

def open_sequence_file(filepath, format):
    filepath = Path(filepath)
    suffix = filepath.suffix.lower()
    if suffix == '.gz':
        # Could be gzip or bgzf - bgzf handles both
        handle = bgzf.open(filepath, 'rt')
    elif suffix == '.bgz':
        handle = bgzf.open(filepath, 'rt')
    elif suffix == '.bz2':
        handle = bz2.open(filepath, 'rt')
    else:
        handle = open(filepath, 'r')
    return SeqIO.parse(handle, format)
```

### Process Large Gzipped File (Memory Efficient)
```python
with gzip.open('large.fastq.gz', 'rt') as handle:
    for record in SeqIO.parse(handle, 'fastq'):
        if len(record.seq) >= 100:
            process(record)
```

### Compress Existing File (Raw Copy)
```python
import shutil

with open('sequences.fasta', 'rb') as f_in:
    with gzip.open('sequences.fasta.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
```

## Compression Comparison

| Format | Extension | Indexable | Speed | Compression |
|--------|-----------|-----------|-------|-------------|
| Gzip | `.gz` | No | Fast | Good |
| BGZF | `.bgz` | **Yes** | Fast | Good |
| Bzip2 | `.bz2` | No | Slow | Better |
| LZMA | `.xz` | No | Slowest | Best |

## When to Use Each Format

| Use Case | Recommended Format |
|----------|-------------------|
| Archive (no random access needed) | gzip or bzip2 |
| Need to index compressed file | **BGZF** |
| BAM files and tabix | BGZF (native) |
| Maximum compression | bzip2 or xz |
| Best speed | gzip or BGZF |

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `TypeError: a bytes-like object is required` | Used 'rb' mode | Use 'rt' for text mode |
| `UnicodeDecodeError` | Wrong encoding | Try `gzip.open(file, 'rt', encoding='latin-1')` |
| `gzip.BadGzipFile` | Not a gzip file | Check file extension matches actual format |
| `OSError: Not a gzipped file` | Corrupt or wrong format | Verify file integrity |
| `SeqIO.index() fails on .gz` | Regular gzip not indexable | Convert to BGZF first |

## Decision Tree

```
Working with compressed sequence files?
├── Just reading sequentially?
│   └── Use gzip.open() or bz2.open() with 'rt' mode
├── Need to index the compressed file?
│   └── Convert to BGZF, then use SeqIO.index()
├── Writing compressed output?
│   ├── Will need to index later? → Use bgzf.open()
│   └── Just archiving? → Use gzip.open() or bz2.open()
└── Converting between formats?
    └── Parse with SeqIO, write to new handle
```

## Related Skills

- read-sequences - Core parsing functions used with compressed handles
- write-sequences - Write to compressed output files
- batch-processing - Process multiple compressed files
- alignment-files - BAM files use BGZF natively; samtools handles compression

Related Skills

zinc-database

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

zarr-python

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

xlsx

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

writing-skills

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use when creating new skills, editing existing skills, or verifying skills work before deployment

writing-plans

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use when you have a spec or requirements for a multi-step task, before touching code

wikipedia-search

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Search and fetch structured content from Wikipedia using the MediaWiki API for reliable, encyclopedic information

wellally-tech

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Integrate digital health data sources (Apple Health, Fitbit, Oura Ring) and connect to WellAlly.tech knowledge base. Import external health device data, standardize to local format, and recommend relevant WellAlly.tech knowledge base articles based on health data. Support generic CSV/JSON import, provide intelligent article recommendations, and help users better manage personal health data.

weightloss-analyzer

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

分析减肥数据、计算代谢率、追踪能量缺口、管理减肥阶段

<!--

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

# COPYRIGHT NOTICE

verification-before-completion

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use when about to claim work is complete, fixed, or passing, before committing or creating PRs - requires running verification commands and confirming output before making any success claims; evidence before assertions always

vcf-annotator

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

vaex

1802

from FreedomIntelligence/OpenClaw-Medical-Skills

Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that do not fit in memory.