home / skills / starlitnightly / omicverse / bulk-combat-correction

bulk-combat-correction skill

safe

This skill harmonises bulk RNA-seq data across batches using ComBat, exports corrected matrices, and benchmarks pre/post correction visually.

npx playbooks add skill starlitnightly/omicverse --skill bulk-combat-correction

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

3.7 KB

---
name: bulk-rna-seq-batch-correction-with-combat
title: Bulk RNA-seq batch correction with ComBat
description: Use omicverse's pyComBat wrapper to remove batch effects from merged bulk RNA-seq or microarray cohorts, export corrected matrices, and benchmark pre/post correction visualisations.
---

# Bulk RNA-seq batch correction with ComBat

## Overview
Apply this skill when a user has multiple bulk expression matrices measured across different batches and needs to harmonise them
 before downstream analysis. It follows [`t_bulk_combat.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_bulk_combat.ipynb), w
hich demonstrates the pyComBat workflow on ovarian cancer microarray cohorts.

## Instructions
1. **Import core libraries**
   - Load `omicverse as ov`, `anndata`, `pandas as pd`, and `matplotlib.pyplot as plt`.
   - Call `ov.ov_plot_set()` (aliased `ov.plot_set()` in some releases) to align figures with omicverse styling.
2. **Load each batch separately**
   - Read the prepared pickled matrices (or user-provided expression tables) with `pd.read_pickle(...)`/`pd.read_csv(...)`.
   - Transpose to gene × sample before wrapping them in `anndata.AnnData` objects so `adata.obs` stores sample metadata.
   - Assign a `batch` column for every cohort (`adata.obs['batch'] = '1'`, `'2'`, ...). Encourage descriptive labels when availa
ble.
3. **Concatenate on shared genes**
   - Use `anndata.concat([adata1, adata2, adata3], merge='same')` to retain the intersection of genes across batches.
   - Confirm the combined `adata` reports balanced sample counts per batch; if not, prompt users to re-check inputs.
4. **Run ComBat batch correction**
   - Execute `ov.bulk.batch_correction(adata, batch_key='batch')`.
   - Explain that corrected values are stored in `adata.layers['batch_correction']` while the original counts remain in `adata.X`.
5. **Export corrected and raw matrices**
   - Obtain DataFrames via `adata.to_df().T` (raw) and `adata.to_df(layer='batch_correction').T` (corrected).
   - Encourage saving both tables (`.to_csv(...)`) plus the harmonised AnnData (`adata.write_h5ad('adata_batch.h5ad', compressio
n='gzip')`).
6. **Benchmark the correction**
   - For per-sample variance checks, draw before/after boxplots and recolour boxes using `ov.utils.red_color`, `blue_color`, `gree
n_color` palettes to match batches.
   - Copy raw counts to a named layer with `adata.layers['raw'] = adata.X.copy()` before PCA.
   - Run `ov.pp.pca(adata, layer='raw', n_pcs=50)` and `ov.pp.pca(adata, layer='batch_correction', n_pcs=50)`.
   - Visualise embeddings with `ov.utils.embedding(..., basis='raw|original|X_pca', color='batch', frameon='small')` and repeat fo
r the corrected layer to verify mixing.
7. **Troubleshooting tips**
   - Mismatched gene identifiers cause dropped features—remind users to harmonise feature names (e.g., gene symbols) before conca
tenation.
   - pyComBat expects log-scale intensities or similarly distributed counts; recommend log-transforming strongly skewed matrices.
   - If `batch_correction` layer is missing, ensure the `batch_key` matches the column name in `adata.obs`.

## Examples
- "Combine three GEO ovarian cohorts, run ComBat, and export both the raw and corrected CSV matrices."
- "Plot PCA embeddings before and after batch correction to confirm that batches 1–3 overlap."
- "Save the harmonised AnnData file so I can reload it later for downstream DEG analysis."

## References
- Tutorial notebook: [`t_bulk_combat.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_bulk_combat.ipynb)
- Example inputs: [`omicverse_guide/docs/Tutorials-bulk/data/combat/`](../../omicverse_guide/docs/Tutorials-bulk/data/combat/)
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill harmonises merged bulk RNA-seq or microarray cohorts using omicverse's pyComBat wrapper to remove batch effects, export corrected matrices, and generate benchmark visualisations. It walks through loading per-batch matrices, concatenating on shared genes, running ComBat, and saving both raw and corrected outputs for downstream analysis. The workflow includes PCA and embedding plots to verify mixing of batches after correction.

How this skill works

You load each cohort into an AnnData object, annotate samples with a batch label, and concatenate on the intersection of genes. ComBat is applied via ov.bulk.batch_correction(adata, batch_key='batch'), which writes corrected values into adata.layers['batch_correction'] while preserving original counts in adata.X. The skill then exports CSVs and a compressed h5ad and generates before/after PCA and variance visualisations to benchmark the correction.

When to use it

Combining multiple bulk RNA-seq or microarray cohorts with suspected batch effects.
Preparing a harmonised expression matrix for differential expression or meta-analysis.
Verifying whether batches mix after correction using PCA or embedding plots.
Exporting corrected and raw matrices for downstream tools that require CSV input.
When you need a reproducible AnnData file containing both raw and corrected layers.

Best practices

Ensure gene identifiers are harmonised (same gene symbols or IDs) before concatenation to avoid dropped features.
Assign descriptive batch labels (e.g., 'GEO_cohort_A') rather than only numeric codes for clearer plots and metadata.
Log-transform strongly skewed counts when required; pyComBat expects similarly distributed inputs.
Copy raw counts into a named layer (adata.layers['raw'] = adata.X.copy()) before running PCA to retain originals.
Save both corrected and raw tables plus a compressed adata write_h5ad(...) for reproducibility and downstream reuse.

Example use cases

Combine three GEO ovarian microarray cohorts, run ComBat, export raw and corrected CSVs, and save harmonised h5ad.
Benchmark batch correction by plotting PCA embeddings before and after correction to confirm improved mixing.
Process multiple bulk RNA-seq studies into a single corrected matrix for pan-cancer differential expression analysis.
Convert corrected AnnData layers to gene×sample matrices for input into external pathway or enrichment tools.

FAQ

What happens to genes not shared across all batches?

Concatenation with merge='same' keeps only the intersection; harmonise feature names upstream if you need union or consistent features.

Where are corrected values stored?

Corrected intensities are stored in adata.layers['batch_correction']; the original matrix remains in adata.X.

My batch_correction layer is missing—what should I check?

Confirm the batch_key matches an adata.obs column and that ov.bulk.batch_correction completed without errors; also check for mismatched gene formats.