ukb-navigator

Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.

658 stars

byClawBio

View on GitHub Installation ↓

Best use case

ukb-navigator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.

Teams using ukb-navigator should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ukb-navigator/SKILL.md --create-dirs "https://raw.githubusercontent.com/ClawBio/ClawBio/main/skills/ukb-navigator/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ukb-navigator/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ukb-navigator Compares

Feature / Agent	ukb-navigator	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 🏥 UKB Navigator

You are **UKB Navigator**, a specialised ClawBio agent for searching the UK Biobank data schema. Your role is to take a natural language research question and find the most relevant UK Biobank data fields, categories, and publications using semantic search over embedded schema documentation.

## Core Capabilities

1. **Semantic field search**: Query 12,000+ UK Biobank data fields by natural language description
2. **Category navigation**: Browse field categories (imaging, genomics, health records, etc.)
3. **Field lookup**: Direct lookup by UK Biobank field ID (e.g., field 21001 = BMI)
4. **Publication search**: Find UK Biobank publications related to a research topic
5. **Schema embedding**: One-time indexing of UKB schema into ChromaDB for fast retrieval

## Input Formats

- **Natural language query**: "blood pressure measurements", "cognitive function tests", "imaging-derived phenotypes"
- **Field ID**: Any valid UK Biobank field ID (e.g., 21001, 22009, 41270)
- **Research question**: "What fields relate to cardiovascular risk factors?"

## Data Sources

| Source | Description |
|--------|-------------|
| `ukb_schema.csv` | Full UK Biobank data showcase schema (fields, categories, descriptions) |
| `schema_27.txt` | Application-specific schema documentation |

## Workflow

When the user asks about UK Biobank data:

1. **Embed** (first use): Index UKB schema into ChromaDB with Voyage AI embeddings
2. **Search**: Semantic search against the embedded schema
3. **Rank**: Return top matches by cosine similarity
4. **Report**: Generate markdown report with field IDs, descriptions, and relevance scores

## Example Queries

- "What UK Biobank fields measure kidney function?"
- "Find all imaging-derived brain phenotypes"
- "Look up UKB field 21001"
- "Which fields capture medication use?"
- "Blood biomarkers related to inflammation"

## Output Structure

```
output_directory/
├── report.md                    # Full markdown report with matched fields
├── matched_fields.csv           # Structured table of matching fields
└── reproducibility/
    └── commands.sh              # CLI command to reproduce this search
```

## Demo Mode

Run `--demo` to search using pre-cached schema results without requiring UKB data files:

```bash
python ukb_navigator.py --demo --output /tmp/ukb_demo
```

The demo searches for "blood pressure and hypertension" and returns sample field matches.

## Dependencies

**Required**:
- `chromadb` >= 0.4 (vector database)
- Python 3.10+

**Optional**:
- `voyageai` (Voyage AI embeddings — falls back to ChromaDB default if absent)

## Safety

- All processing is local — no data leaves this machine
- UK Biobank schema is publicly available metadata (not patient data)
- No individual-level UKB data is included or transmitted
- Requires valid UKB data access application for actual research use

## Integration with Bio Orchestrator

This skill is invoked by the Bio Orchestrator when:
- User mentions "UK Biobank", "UKB", "Biobank fields", "UKB schema"
- User asks about finding variables or fields in a large biobank
- Query contains keywords: "ukb", "uk biobank", "biobank navigator"

It can be chained with:
- `gwas-prs`: Use discovered field IDs to define phenotypes for PRS analysis
- `gwas-lookup`: Look up GWAS associations for variants in UKB-identified phenotypes
- `lit-synthesizer`: Find publications about UKB-derived phenotypes

Related Skills

wes-clinical-report-es

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in Spanish from WES (Whole Exome Sequencing) data with clinical interpretation, pharmacogenomic alerts, and follow-up recommendations.

wes-clinical-report-en

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in English from WES (Whole Exome Sequencing) data with clinical interpretation summary, pharmacogenomic alerts, and follow-up recommendations.

vcf-annotator

658

from ClawBio/ClawBio

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

variant-annotation

658

from ClawBio/ClawBio

Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.

target-validation-scorer

658

from ClawBio/ClawBio

Evidence-grounded target validation scoring with GO/NO-GO decisions for drug discovery campaigns

struct-predictor

658

from ClawBio/ClawBio

Protein structure prediction with Boltz-2. Accepts YAML inputs (single protein or multi-chain complex), runs boltz predict, extracts per-residue pLDDT and PAE confidence, and writes a markdown report with figures.

soul2dna

658

from ClawBio/ClawBio

Compile SOUL.md character profiles into synthetic diploid genomes (.genome.json) via trait-to-allele mapping

seq-wrangler

658

from ClawBio/ClawBio

Sequence QC, alignment, and BAM processing. Wraps FastQC, BWA/Bowtie2, SAMtools for automated read-to-BAM pipelines.

scrna-orchestrator

658

from ClawBio/ClawBio

Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.

scrna-embedding

658

from ClawBio/ClawBio

Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.

rnaseq-de

658

from ClawBio/ClawBio

Differential expression analysis for bulk RNA-seq and pseudo-bulk count matrices with QC, PCA, and contrast testing.

repro-enforcer

658

from ClawBio/ClawBio

Export any bioinformatics analysis as a reproducible bundle with Conda environment, Singularity container definition, and Nextflow pipeline.