tcga-bulk-data-preprocessing-with-omicverse
Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.
Best use case
tcga-bulk-data-preprocessing-with-omicverse is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.
Teams using tcga-bulk-data-preprocessing-with-omicverse should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/tcga-preprocessing/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How tcga-bulk-data-preprocessing-with-omicverse Compares
| Feature / Agent | tcga-bulk-data-preprocessing-with-omicverse | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
# TCGA bulk data preprocessing with omicverse
## Overview
Follow this skill to recreate the preprocessing routine from [`t_tcga.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_tcga.ipynb). It automates loading TCGA downloads, generating raw/normalised matrices, initialising metadata, and running survival analyses through `ov.bulk.pyTCGA`.
## Instructions
1. **Gather required downloads**
- Confirm the user has:
- `gdc_sample_sheet.<date>.tsv` from the TCGA Sample Sheet export.
- The decompressed `gdc_download_xxxxx` directory containing expression archives.
- The `clinical.cart.<date>` directory with clinical XML/JSON files.
- Mention that sample data are available under [`omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/`](../../omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/).
2. **Initialise the TCGA helper**
- Import `omicverse as ov` (and `scanpy as sc` if plotting) then call `ov.plot_set()`.
- Instantiate `aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir)`.
- Run `aml_tcga.adata_init()` to build the AnnData object with raw counts, FPKM, and TPM layers.
3. **Persist the dataset**
- Encourage saving the initial AnnData: `aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad', compression='gzip')`.
- When reloading, reconstruct the class with the same paths and call `aml_tcga.adata_read(<path>)`.
4. **Initialise metadata and clinical information**
- Populate sample metadata using `aml_tcga.adata_meta_init()` to convert gene IDs to symbols and attach patient info.
- Add survival attributes via `aml_tcga.survial_init()` (note the intentional spelling in the API).
5. **Perform survival analyses**
- Plot gene-level survival curves with `aml_tcga.survival_analysis('GENE', layer='deseq_normalize', plot=True)`.
- To process all genes, call `aml_tcga.survial_analysis_all()`; warn that it may take time.
6. **Export results**
- Save enriched metadata to a new AnnData file (`aml_tcga.adata.write_h5ad('.../ov_tcga_survial_all.h5ad', compression='gzip')`).
- Suggest exporting summary tables (e.g., survival statistics) if users need to share outputs outside Python.
7. **Troubleshooting tips**
- Ensure TCGA archives are fully extracted; missing XML/TSV files trigger parsing errors.
- The helper expects matching case IDs between the sample sheet and expression files—direct users to re-download if IDs do not
align.
- Survival plots require clinical dates; if absent, instruct users to check the `clinical_cart` contents.
## Examples
- "Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts."
- "Reload a saved AnnData file, attach survival annotations, and export the updated `.h5ad`."
- "Run survival analysis for all genes and store the enriched dataset."
## References
- Tutorial notebook: [`t_tcga.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_tcga.ipynb)
- Sample dataset: [`data/TCGA_OV/`](../../omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/)
- Quick copy/paste commands: [`reference.md`](reference.md)Related Skills
zinc-database
Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.
uspto-database
Access USPTO APIs for patent/trademark searches, examination history (PEDS), assignments, citations, office actions, TSDR, for IP analysis and prior art searches.
uniprot-database
Direct REST API access to UniProt. Protein searches, FASTA retrieval, ID mapping, Swiss-Prot/TrEMBL. For Python workflows with multiple databases, prefer bioservices (unified interface to 40+ services). Use this for direct HTTP/REST work or UniProt-specific control.
tooluniverse-expression-data-retrieval
Retrieves gene expression and omics datasets from ArrayExpress and BioStudies with gene disambiguation, experiment quality assessment, and structured reports. Creates comprehensive dataset profiles with metadata, sample information, and download links. Use when users need expression data, omics datasets, or mention ArrayExpress (E-MTAB, E-GEOD) or BioStudies (S-BSST) accessions.
string-database
Query STRING API for protein-protein interactions (59M proteins, 20B interactions). Network analysis, GO/KEGG enrichment, interaction discovery, 5000+ species, for systems biology.
spatial-transcriptomics-tutorials-with-omicverse
Guide users through omicverse's spatial transcriptomics tutorials covering preprocessing, deconvolution, and downstream modelling workflows across Visium, Visium HD, Stereo-seq, and Slide-seq datasets.
single-cell-preprocessing-with-omicverse
Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.
single-cell-clustering-and-batch-correction-with-omicverse
Guide Claude through omicverse's single-cell clustering workflow, covering preprocessing, QC, multimethod clustering, topic modeling, cNMF, and cross-batch integration as demonstrated in t_cluster.ipynb and t_single_batch.ipynb.
single-cell-annotation-skills-with-omicverse
Guide Claude through SCSA, MetaTiME, CellVote, CellMatch, GPTAnno, and weighted KNN transfer workflows for annotating single-cell modalities.
reactome-database
Query Reactome REST API for pathway analysis, enrichment, gene-pathway mapping, disease pathways, molecular interactions, expression analysis, for systems biology studies.
pubmed-database
Direct REST API access to PubMed. Advanced Boolean/MeSH queries, E-utilities API, batch processing, citation management. For Python workflows, prefer biopython (Bio.Entrez). Use this for direct HTTP/REST work or custom API implementations.
pubchem-database
Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics.