tcga-bulk-data-preprocessing-with-omicverse

Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.

1,802 stars

Best use case

tcga-bulk-data-preprocessing-with-omicverse is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.

Teams using tcga-bulk-data-preprocessing-with-omicverse should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/tcga-preprocessing/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/tcga-preprocessing/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/tcga-preprocessing/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How tcga-bulk-data-preprocessing-with-omicverse Compares

Feature / Agenttcga-bulk-data-preprocessing-with-omicverseStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# TCGA bulk data preprocessing with omicverse

## Overview
Follow this skill to recreate the preprocessing routine from [`t_tcga.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_tcga.ipynb). It automates loading TCGA downloads, generating raw/normalised matrices, initialising metadata, and running survival analyses through `ov.bulk.pyTCGA`.

## Instructions
1. **Gather required downloads**
   - Confirm the user has:
     - `gdc_sample_sheet.<date>.tsv` from the TCGA Sample Sheet export.
     - The decompressed `gdc_download_xxxxx` directory containing expression archives.
     - The `clinical.cart.<date>` directory with clinical XML/JSON files.
   - Mention that sample data are available under [`omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/`](../../omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/).
2. **Initialise the TCGA helper**
   - Import `omicverse as ov` (and `scanpy as sc` if plotting) then call `ov.plot_set()`.
   - Instantiate `aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir)`.
   - Run `aml_tcga.adata_init()` to build the AnnData object with raw counts, FPKM, and TPM layers.
3. **Persist the dataset**
   - Encourage saving the initial AnnData: `aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad', compression='gzip')`.
   - When reloading, reconstruct the class with the same paths and call `aml_tcga.adata_read(<path>)`.
4. **Initialise metadata and clinical information**
   - Populate sample metadata using `aml_tcga.adata_meta_init()` to convert gene IDs to symbols and attach patient info.
   - Add survival attributes via `aml_tcga.survial_init()` (note the intentional spelling in the API).
5. **Perform survival analyses**
   - Plot gene-level survival curves with `aml_tcga.survival_analysis('GENE', layer='deseq_normalize', plot=True)`.
   - To process all genes, call `aml_tcga.survial_analysis_all()`; warn that it may take time.
6. **Export results**
   - Save enriched metadata to a new AnnData file (`aml_tcga.adata.write_h5ad('.../ov_tcga_survial_all.h5ad', compression='gzip')`).
   - Suggest exporting summary tables (e.g., survival statistics) if users need to share outputs outside Python.
7. **Troubleshooting tips**
   - Ensure TCGA archives are fully extracted; missing XML/TSV files trigger parsing errors.
   - The helper expects matching case IDs between the sample sheet and expression files—direct users to re-download if IDs do not
 align.
   - Survival plots require clinical dates; if absent, instruct users to check the `clinical_cart` contents.

## Examples
- "Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts."
- "Reload a saved AnnData file, attach survival annotations, and export the updated `.h5ad`."
- "Run survival analysis for all genes and store the enriched dataset."

## References
- Tutorial notebook: [`t_tcga.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_tcga.ipynb)
- Sample dataset: [`data/TCGA_OV/`](../../omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/)
- Quick copy/paste commands: [`reference.md`](reference.md)

Related Skills

zinc-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

uspto-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Access USPTO APIs for patent/trademark searches, examination history (PEDS), assignments, citations, office actions, TSDR, for IP analysis and prior art searches.

uniprot-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Direct REST API access to UniProt. Protein searches, FASTA retrieval, ID mapping, Swiss-Prot/TrEMBL. For Python workflows with multiple databases, prefer bioservices (unified interface to 40+ services). Use this for direct HTTP/REST work or UniProt-specific control.

tooluniverse-expression-data-retrieval

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Retrieves gene expression and omics datasets from ArrayExpress and BioStudies with gene disambiguation, experiment quality assessment, and structured reports. Creates comprehensive dataset profiles with metadata, sample information, and download links. Use when users need expression data, omics datasets, or mention ArrayExpress (E-MTAB, E-GEOD) or BioStudies (S-BSST) accessions.

string-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Query STRING API for protein-protein interactions (59M proteins, 20B interactions). Network analysis, GO/KEGG enrichment, interaction discovery, 5000+ species, for systems biology.

spatial-transcriptomics-tutorials-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide users through omicverse's spatial transcriptomics tutorials covering preprocessing, deconvolution, and downstream modelling workflows across Visium, Visium HD, Stereo-seq, and Slide-seq datasets.

single-cell-preprocessing-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.

single-cell-clustering-and-batch-correction-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through omicverse's single-cell clustering workflow, covering preprocessing, QC, multimethod clustering, topic modeling, cNMF, and cross-batch integration as demonstrated in t_cluster.ipynb and t_single_batch.ipynb.

single-cell-annotation-skills-with-omicverse

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Guide Claude through SCSA, MetaTiME, CellVote, CellMatch, GPTAnno, and weighted KNN transfer workflows for annotating single-cell modalities.

reactome-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Query Reactome REST API for pathway analysis, enrichment, gene-pathway mapping, disease pathways, molecular interactions, expression analysis, for systems biology studies.

pubmed-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Direct REST API access to PubMed. Advanced Boolean/MeSH queries, E-utilities API, batch processing, citation management. For Python workflows, prefer biopython (Bio.Entrez). Use this for direct HTTP/REST work or custom API implementations.

pubchem-database

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics.