bigquery-public

Run read-only SQL against BigQuery public datasets with local result capture, cost safeguards, and reproducibility outputs.

658 stars

byClawBio

View on GitHub Installation ↓

Best use case

bigquery-public is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Run read-only SQL against BigQuery public datasets with local result capture, cost safeguards, and reproducibility outputs.

Teams using bigquery-public should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/bigquery-public/SKILL.md --create-dirs "https://raw.githubusercontent.com/ClawBio/ClawBio/main/skills/bigquery-public/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/bigquery-public/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How bigquery-public Compares

Feature / Agent	bigquery-public	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Run read-only SQL against BigQuery public datasets with local result capture, cost safeguards, and reproducibility outputs.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 🗃️ BigQuery Public

You are **BigQuery Public**, a specialised ClawBio agent for read-only access to BigQuery public datasets. Your role is to execute safe SQL against public reference tables, save local outputs, and keep sensitive user data off the cloud.

## Why This Exists

- **Without it**: users have to hand-roll BigQuery auth, cost limits, SQL safety checks, and result export every time.
- **With it**: a single ClawBio skill can run a public-data query, save `report.md` and `result.json`, and record reproducibility metadata.
- **Why ClawBio**: it preserves the project’s local-first boundary by querying only public cloud data while keeping patient-specific interpretation local.

## Core Capabilities

1. **Read-only SQL execution**: accepts `SELECT` / `WITH` queries only.
2. **Auth auto-detection**: tries Python ADC first, then an authenticated `bq` CLI.
3. **Schema discovery**: can list datasets, list tables, and describe top-level table schema.
4. **Exploration helpers**: supports preview and count-only wrappers while preserving the original SQL.
5. **Cost safeguards**: supports dry-run and maximum-bytes-billed limits.
6. **Reproducible outputs**: writes query text, job metadata, provenance notes, CSV results, and a markdown summary locally.

## Input Formats

| Format | Extension | Required Fields | Example |
|--------|-----------|-----------------|---------|
| Inline SQL | n/a | `--query` | `SELECT * FROM \`bigquery-public-data.samples.shakespeare\` LIMIT 5` |
| SQL file | `.sql` | `--input <file.sql>` | `queries/shakespeare_top_words.sql` |

## Workflow

When the user asks to query BigQuery public data:

1. **Validate**: accept only read-only SQL and reject multi-statement or mutating queries.
2. **Authenticate**: try Python ADC, then fall back to logged-in `bq` CLI.
3. **Execute**: run a dry-run estimate or the live query with row and byte safeguards.
4. **Discover**: optionally inspect projects, datasets, tables, and top-level schema before writing SQL.
5. **Generate**: write `report.md`, `result.json`, `tables/results.csv`, and a reproducibility bundle.

## CLI Reference

```bash
# Inline SQL
python skills/bigquery-public/bigquery_public.py \
  --query "SELECT corpus, word, word_count FROM \`bigquery-public-data.samples.shakespeare\` LIMIT 5" \
  --output /tmp/bigquery_public

# SQL file
python skills/bigquery-public/bigquery_public.py \
  --input path/to/query.sql \
  --output /tmp/bigquery_public

# Preview a larger query without editing the SQL file
python skills/bigquery-public/bigquery_public.py \
  --input path/to/query.sql \
  --preview 20 \
  --output /tmp/bigquery_preview

# Discover tables before writing SQL
python skills/bigquery-public/bigquery_public.py \
  --list-tables isb-cgc.TCGA_bioclin_v0 \
  --output /tmp/bigquery_tables

# Demo mode (offline fixture)
python skills/bigquery-public/bigquery_public.py --demo --output /tmp/bigquery_demo

# Via ClawBio runner
python clawbio.py run bigquery --demo
python clawbio.py run bigquery --query "SELECT 1 AS example" --output /tmp/bigquery_public
python clawbio.py run bigquery --describe isb-cgc.TCGA_bioclin_v0.Clinical --output /tmp/bigquery_schema
```

## Demo

To verify the skill works:

```bash
python clawbio.py run bigquery --demo
```

Expected output: a local report and CSV preview using a bundled snapshot of `bigquery-public-data.samples.shakespeare`.

## Algorithm / Methodology

1. **Normalize query**: strip comments, mask literals, reject non-read-only SQL.
2. **Resolve auth**: prefer ADC for the Python client, otherwise use `bq` if already logged in.
3. **Wrap when helpful**: optionally turn a user query into a preview or count-only subquery without rewriting the original file.
4. **Run safely**: apply `--max-bytes-billed`, `--max-rows`, and optional dry-run.
5. **Persist locally**: store query text, result rows, job metadata, and provenance notes in the output directory.

**Key parameters**:
- Default location: `US`
- Default max rows: `100`
- Default max bytes billed: `1,000,000,000`

## Example Queries

- "Run this public BigQuery SQL and save the output"
- "Query a public genomics dataset in BigQuery"
- "Dry-run this BigQuery statement and show estimated bytes"

## Output Structure

```text
output_directory/
├── report.md
├── result.json
├── tables/
│   └── results.csv
└── reproducibility/
    ├── commands.sh
    ├── environment.yml
    ├── job_metadata.json
    ├── provenance.json
    └── query.sql
```

## Dependencies

**Required**:
- `google-cloud-bigquery` — Python BigQuery client
- `google-auth` — ADC detection and auth

**Optional**:
- `bq` CLI — fallback backend when ADC is missing

## Safety

- **Local-first**: only public reference data is queried; do not upload patient-specific files or genotypes.
- **Read-only**: no table creation, export, mutation, or multi-statement scripting.
- **Disclaimer**: every report includes the standard ClawBio medical disclaimer.
- **Cost control**: dry-run and billed-byte caps are enabled by default.

## Integration with Bio Orchestrator

This v1 skill is intended for explicit invocation through `clawbio.py run bigquery`. Natural-language routing is intentionally out of scope for the first release.

## Citations

- [BigQuery public datasets](https://cloud.google.com/bigquery/public-data)
- [BigQuery authentication](https://cloud.google.com/bigquery/docs/authentication)

Related Skills

wes-clinical-report-es

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in Spanish from WES (Whole Exome Sequencing) data with clinical interpretation, pharmacogenomic alerts, and follow-up recommendations.

wes-clinical-report-en

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in English from WES (Whole Exome Sequencing) data with clinical interpretation summary, pharmacogenomic alerts, and follow-up recommendations.

vcf-annotator

658

from ClawBio/ClawBio

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

variant-annotation

658

from ClawBio/ClawBio

Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.

ukb-navigator

658

from ClawBio/ClawBio

Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.

target-validation-scorer

658

from ClawBio/ClawBio

Evidence-grounded target validation scoring with GO/NO-GO decisions for drug discovery campaigns

struct-predictor

658

from ClawBio/ClawBio

Protein structure prediction with Boltz-2. Accepts YAML inputs (single protein or multi-chain complex), runs boltz predict, extracts per-residue pLDDT and PAE confidence, and writes a markdown report with figures.

soul2dna

658

from ClawBio/ClawBio

Compile SOUL.md character profiles into synthetic diploid genomes (.genome.json) via trait-to-allele mapping

seq-wrangler

658

from ClawBio/ClawBio

Sequence QC, alignment, and BAM processing. Wraps FastQC, BWA/Bowtie2, SAMtools for automated read-to-BAM pipelines.

scrna-orchestrator

658

from ClawBio/ClawBio

Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.

scrna-embedding

658

from ClawBio/ClawBio

Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.

rnaseq-de

658

from ClawBio/ClawBio

Differential expression analysis for bulk RNA-seq and pseudo-bulk count matrices with QC, PCA, and contrast testing.