home / skills / starlitnightly / omicverse / tcga-preprocessing

tcga-preprocessing skill

/.claude/skills/tcga-preprocessing

This skill guides you through loading TCGA data, initializing metadata, and exporting annotated AnnData while enabling survival analyses.

npx playbooks add skill starlitnightly/omicverse --skill tcga-preprocessing

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
5.8 KB
---
name: tcga-bulk-data-preprocessing-with-omicverse
title: TCGA bulk data preprocessing with omicverse
description: "TCGA bulk RNA-seq preprocessing with pyTCGA: GDC sample sheets, expression archives, clinical metadata, Kaplan-Meier survival analysis, and annotated AnnData export."
---

# TCGA Bulk Data Preprocessing with OmicVerse

## Overview
Use this skill for loading TCGA data from GDC downloads, building normalised expression matrices, attaching clinical metadata, and running survival analyses through `ov.bulk.pyTCGA`.

## Instructions

### 1. Gather required downloads
Confirm the user has three items from the GDC Data Portal:
- `gdc_sample_sheet.<date>.tsv` — the sample sheet export
- Decompressed `gdc_download_xxxxx/` directory with expression archives
- `clinical.cart.<date>/` directory with clinical XML/JSON files

### 2. Initialise the TCGA helper
```python
import omicverse as ov
import scanpy as sc
ov.plot_set()

aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir)
aml_tcga.adata_init()  # Builds AnnData with raw counts, FPKM, and TPM layers
```

### 3. Persist and reload
```python
aml_tcga.adata.write_h5ad('data/ov_tcga_raw.h5ad', compression='gzip')

# To reload later:
new_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir)
new_tcga.adata_read('data/ov_tcga_raw.h5ad')
```

### 4. Initialise metadata and survival
```python
aml_tcga.adata_meta_init()   # Gene ID → symbol mapping, patient info
aml_tcga.survial_init()      # NOTE: "survial" spelling — see Critical API Reference below
```

### 5. Run survival analysis
```python
# Single gene
aml_tcga.survival_analysis('MYC', layer='deseq_normalize', plot=True)

# All genes (can take minutes for large gene sets)
aml_tcga.survial_analysis_all()  # NOTE: "survial" spelling
```

### 6. Export results
```python
aml_tcga.adata.write_h5ad('data/ov_tcga_survival.h5ad', compression='gzip')
```

## Critical API Reference

### IMPORTANT: Method Name Spelling Inconsistency

The pyTCGA API has an intentional spelling inconsistency. Two methods use "survial" (missing the 'v') while one uses the correct "survival":

| Method | Spelling | Purpose |
|--------|----------|---------|
| `survial_init()` | **survial** (no 'v') | Initialize survival metadata columns |
| `survival_analysis(gene, layer, plot)` | **survival** (correct) | Single-gene Kaplan-Meier curve |
| `survial_analysis_all()` | **survial** (no 'v') | Sweep all genes for survival significance |

```python
# CORRECT — use the exact method names as documented
aml_tcga.survial_init()                    # "survial" — no 'v'
aml_tcga.survival_analysis('MYC', layer='deseq_normalize', plot=True)  # "survival" — correct
aml_tcga.survial_analysis_all()            # "survial" — no 'v'

# WRONG — these will raise AttributeError
# aml_tcga.survival_init()                 # AttributeError! Use survial_init()
# aml_tcga.survival_analysis_all()         # AttributeError! Use survial_analysis_all()
```

### Survival Analysis Methodology

`survival_analysis()` performs Kaplan-Meier analysis:
1. Splits patients into high/low expression groups using the **median** as cutoff
2. Computes a **log-rank test** p-value to assess significance
3. If `plot=True`, renders survival curves with confidence intervals

**Layer selection matters**: Use `layer='deseq_normalize'` (recommended) because DESeq2 normalization accounts for library size and composition bias, making expression comparable across samples. Alternative: `layer='tpm'` for TPM-normalized values.

## Defensive Validation Patterns

```python
import os

# Before pyTCGA init: verify all paths exist
for name, path in [('sample_sheet', sample_sheet_path),
                    ('downloads', download_dir),
                    ('clinical', clinical_dir)]:
    if not os.path.exists(path):
        raise FileNotFoundError(f"TCGA {name} path not found: {path}")

# After adata_init(): verify expected layers were created
expected_layers = ['counts', 'fpkm', 'tpm']
for layer in expected_layers:
    if layer not in aml_tcga.adata.layers:
        print(f"WARNING: Missing layer '{layer}' — check if TCGA archives are fully extracted")

# Before survival analysis: verify metadata is initialized
if 'survial_init' not in dir(aml_tcga) or aml_tcga.adata.obs.shape[1] < 5:
    print("WARNING: Run adata_meta_init() and survial_init() before survival analysis")
```

## Troubleshooting

- **`AttributeError: 'pyTCGA' object has no attribute 'survival_init'`**: Use the misspelled name `survial_init()` (missing 'v'). Same for `survial_analysis_all()`. See Critical API Reference above.
- **`KeyError` during `adata_meta_init()`**: Gene IDs in the expression matrix don't match expected format. TCGA uses ENSG IDs; the method maps them to symbols internally. Ensure archives are from the same GDC download.
- **Empty survival plot or NaN p-values**: Clinical XML files are missing date fields (days_to_death, days_to_last_follow_up). Check that the `clinical.cart.*` directory contains complete XML files, not just metadata JSONs.
- **`survial_analysis_all()` runs very slowly**: This tests every gene individually. For a genome with ~20,000 genes, expect 5-15 minutes. Consider filtering to genes of interest first.
- **Sample sheet column mismatch**: Verify the TSV uses tab separators and the header row matches GDC's expected format. Re-download from GDC if column names differ.
- **Missing `deseq_normalize` layer**: This layer is created during `adata_meta_init()`. If absent, re-run the metadata initialization step.

## Examples
- "Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts."
- "Reload a saved AnnData file, attach survival annotations, and export the updated `.h5ad`."
- "Run survival analysis for all genes and store the enriched dataset."

## References
- Tutorial notebook: `t_tcga.ipynb`
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill guides Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse to produce annotated AnnData files ready for downstream analysis. It automates building raw and normalized matrices, initializing sample and survival metadata, and exporting enriched .h5ad files. The workflow mirrors a reproducible Jupyter notebook routine for TCGA bulk RNA-seq preprocessing with omicverse.

How this skill works

The skill instructs loading three inputs: the TCGA sample sheet TSV, the decompressed expression download directory, and the clinical cart directory. It shows how to instantiate ov.bulk.pyTCGA, run adata_init() to assemble raw counts, FPKM and TPM layers, initialize metadata and survival attributes (noting the API method name survial_init()), and perform gene-level survival analyses. Final steps cover saving the AnnData object and exporting summary tables for sharing.

When to use it

  • Preparing TCGA bulk RNA-seq downloads for omicverse analysis and visualization.
  • Standardizing raw and normalized expression matrices into a single AnnData file.
  • Annotating samples with clinical metadata and survival attributes for downstream modeling.
  • Running gene-level survival plots or full-cohort survival scans before downstream statistics.
  • Recreating a published preprocessing pipeline or sharing processed TCGA datasets.

Best practices

  • Ensure the sample sheet, extracted expression archives, and clinical cart are complete and match by case IDs.
  • Save the initial assembled AnnData after adata_init() to avoid reprocessing large downloads.
  • Verify clinical XML/JSON contain date fields required for survival; otherwise survival will be incomplete.
  • Use consistent file paths when reconstructing the pyTCGA helper; call adata_read() to reload saved .h5ad.
  • Be aware that survial_analysis_all() processes many genes and can be time-consuming; run on a cluster if available.

Example use cases

  • Read a TCGA OV download set, build the AnnData with raw and DESeq-normalised layers, then plot MYC survival curves.
  • Reload a previously saved ov_tcga_raw.h5ad, run metadata and survival initialization, and export ov_tcga_survial_all.h5ad.
  • Run survial_analysis_all() to compute survival statistics across all genes and save summary tables for publication.
  • Troubleshoot ID mismatches between sample_sheet and expression files and re-download specific archives when needed.

FAQ

What inputs do I need to run this workflow?

You need the TCGA sample sheet TSV, the decompressed gdc_download directory with expression archives, and the clinical.cart directory containing clinical XML/JSON files.

Why is the method named survial_init() instead of survival_init()?

The omicverse API uses the intentionally spelled survial_init() method name; call it exactly as written to initialize survival attributes.