home / skills / starlitnightly / omicverse / bulk-to-single-deconvolution

bulk-to-single-deconvolution skill

/.claude/skills/bulk-to-single-deconvolution

This skill reconstructs single-cell profiles from bulk RNA-seq using Bulk2Single, trains a beta-VAE, and benchmarks against reference scRNA-seq.

npx playbooks add skill starlitnightly/omicverse --skill bulk-to-single-deconvolution

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
4.4 KB
---
name: bulk-rna-seq-deconvolution-with-bulk2single
title: Bulk RNA-seq deconvolution with Bulk2Single
description: Turn bulk RNA-seq cohorts into synthetic single-cell datasets using omicverse's Bulk2Single workflow for cell fraction estimation, beta-VAE generation, and quality control comparisons against reference scRNA-seq.
---

# Bulk RNA-seq deconvolution with Bulk2Single

## Overview
Use this skill when a user wants to reconstruct single-cell profiles from bulk RNA-seq together with a matched reference scRNA-seq atlas. It follows [`t_bulk2single.ipynb`](../../omicverse_guide/docs/Tutorials-bulk2single/t_bulk2single.ipynb), which demonstrates how to harmonise PDAC bulk replicates, train the beta-VAE generator, and benchmark the output cells against dentate gyrus scRNA-seq.

## Instructions
1. **Load libraries and data**
   - Import `omicverse as ov`, `scanpy as sc`, `scvelo as scv`, `anndata`, and `matplotlib.pyplot as plt`, then call `ov.plot_set()` to match omicverse styling.
   - Read the bulk counts table with `ov.read(...)`/`ov.utils.read(...)` and harmonise gene identifiers via `ov.bulk.Matrix_ID_mapping(<df>, 'genesets/pair_GRCm39.tsv')`.
   - Load the reference scRNA-seq AnnData (e.g., `scv.datasets.dentategyrus()`) and confirm the cluster labels (stored in `adata.obs['clusters']`).
2. **Initialise the Bulk2Single model**
   - Instantiate `ov.bulk2single.Bulk2Single(bulk_data=bulk_df, single_data=adata, celltype_key='clusters', bulk_group=['dg_d_1', 'dg_d_2', 'dg_d_3'], top_marker_num=200, ratio_num=1, gpu=0)`.
   - Explain GPU selection (`gpu=-1` forces CPU) and how `bulk_group` names align with column IDs in the bulk matrix.
3. **Estimate cell fractions**
   - Call `model.predicted_fraction()` to run the integrated TAPE estimator, then plot stacked bar charts per sample to validate proportions.
   - Encourage saving the fraction table for downstream reporting (`df.to_csv(...)`).
4. **Preprocess for beta-VAE**
   - Execute `model.bulk_preprocess_lazy()`, `model.single_preprocess_lazy()`, and `model.prepare_input()` to produce matched feature spaces.
   - Clarify that the lazy preprocessing expects raw counts; skip if the user has already log-normalised data and instead provide aligned matrices manually.
5. **Train or load the beta-VAE**
   - Train with `model.train(batch_size=512, learning_rate=1e-4, hidden_size=256, epoch_num=3500, vae_save_dir='...', vae_save_name='dg_vae', generate_save_dir='...', generate_save_name='dg')`.
   - Mention early stopping via `patience` and how to resume by reloading weights with `model.load('.../dg_vae.pth')`.
   - Use `model.plot_loss()` to monitor convergence.
6. **Generate and filter synthetic cells**
   - Produce an AnnData using `model.generate()` and reduce noise through `model.filtered(generate_adata, leiden_size=25)`.
   - Store the filtered AnnData (`.write_h5ad`) for reuse, noting it contains PCA embeddings in `obsm['X_pca']`.
7. **Benchmark against the reference atlas**
   - Plot cell-type compositions with `ov.bulk2single.bulk2single_plot_cellprop(...)` for both generated and reference data.
   - Assess correlation using `ov.bulk2single.bulk2single_plot_correlation(single_data, generate_adata, celltype_key='clusters')`.
   - Embed with `generate_adata.obsm['X_mde'] = ov.utils.mde(generate_adata.obsm['X_pca'])` and visualise via `ov.utils.embedding(..., color=['clusters'], palette=ov.utils.pyomic_palette())`.
8. **Troubleshooting tips**
   - If marker selection fails, increase `top_marker_num` or provide a curated marker list.
   - Alignment errors typically stem from mismatched `bulk_group` names—double-check column IDs in the bulk matrix.
   - Training on CPU can take several hours; advise switching `gpu` to an available CUDA device for speed.

## Examples
- "Estimate cell fractions for PDAC bulk replicates and generate synthetic scRNA-seq using Bulk2Single."
- "Load a pre-trained Bulk2Single model, regenerate cells, and compare cluster proportions to the dentate gyrus atlas."
- "Plot correlation heatmaps between generated cells and reference clusters after filtering noisy synthetic cells."

## References
- Tutorial notebook: [`t_bulk2single.ipynb`](../../omicverse_guide/docs/Tutorials-bulk2single/t_bulk2single.ipynb)
- Example data and weights: [`omicverse_guide/docs/Tutorials-bulk2single/data/`](../../omicverse_guide/docs/Tutorials-bulk2single/data/)
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill turns bulk RNA-seq cohorts into synthetic single-cell datasets using omicverse's Bulk2Single workflow. It combines cell-fraction estimation, beta-VAE generation, and quality-control comparisons against a matched reference scRNA-seq atlas. The output is an AnnData object of generated cells plus plots and tables for benchmarking and downstream analysis.

How this skill works

The workflow first harmonises gene identifiers and estimates sample-wise cell fractions from bulk counts using an integrated deconvolution estimator. It then aligns features between bulk and reference single-cell data, trains a beta-variational autoencoder to generate per-cell expression profiles, and filters low-quality synthetic cells. Final steps produce composition plots, correlation heatmaps, and embeddings to compare generated profiles to the reference atlas.

When to use it

  • You have bulk RNA-seq from heterogeneous tissue and a matched scRNA-seq reference atlas.
  • You want per-sample synthetic single-cell profiles for downstream single-cell pipelines.
  • You need cell-type proportion estimates for cohort-level reporting.
  • You plan to benchmark synthetic data quality against a known single-cell reference.
  • You want to augment limited single-cell data with generated cells for method testing.

Best practices

  • Provide raw counts for both bulk and reference single-cell data to allow the lazy preprocessing steps to run correctly.
  • Confirm bulk_group names match the column IDs in your bulk matrix before initialising the model.
  • Start training on GPU (gpu set to a CUDA device) to reduce run time; use gpu=-1 only for CPU runs.
  • Save intermediate outputs (fraction table, trained VAE weights, generated AnnData) to enable resuming and reproducibility.
  • If marker selection fails, increase top_marker_num or supply a curated marker list to improve alignment.

Example use cases

  • Estimate cell fractions for PDAC bulk replicates and generate synthetic single cells for cell-type-specific differential expression.
  • Load pre-trained Bulk2Single weights to quickly regenerate and benchmark cells against the dentate gyrus atlas.
  • Create composition plots and correlation heatmaps to validate how closely generated cells match reference clusters.
  • Filter noisy generated cells and export a filtered AnnData for downstream trajectory or RNA-velocity analyses.

FAQ

Can I run training on CPU?

Yes, set gpu=-1 to force CPU, but expect training to take substantially longer; use GPU when available.

What if bulk_group names do not match my matrix columns?

Double-check and correct bulk_group to match column IDs in the bulk count table; misalignment causes errors in fraction estimation and downstream steps.