home / skills / starlitnightly / omicverse / bulk-trajblend-interpolation

bulk-trajblend-interpolation skill

safe

/.claude/skills/bulk-trajblend-interpolation

This skill bridges gaps in scRNA-seq trajectories by integrating BulkTrajBlend, training beta-VAE and GNN models to interpolate missing cell states.

npx playbooks add skill starlitnightly/omicverse --skill bulk-trajblend-interpolation

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

4.5 KB

---
name: bulktrajblend-trajectory-interpolation
title: BulkTrajBlend trajectory interpolation
description: Extend scRNA-seq developmental trajectories with BulkTrajBlend by generating intermediate cells from bulk RNA-seq, training beta-VAE and GNN models, and interpolating missing states.
---

# BulkTrajBlend trajectory interpolation

## Overview
Invoke this skill when users need to bridge gaps in single-cell developmental trajectories using matched bulk RNA-seq. It follows [`t_bulktrajblend.ipynb`](../../omicverse_guide/docs/Tutorials-bulk2single/t_bulktrajblend.ipynb), showcasing how BulkTrajBlend deconvolves PDAC bulk samples, identifies overlapping communities with a GNN, and interpolates "interrupted" cell states.

## Instructions
1. **Prepare libraries and inputs**
   - Import `omicverse as ov`, `scanpy as sc`, `scvelo as scv`, and helper functions like `from omicverse.utils import mde`; run `ov.plot_set()`.
   - Load the reference scRNA-seq AnnData (`scv.datasets.dentategyrus()`) and raw bulk counts with `ov.utils.read(...)` followed by `ov.bulk.Matrix_ID_mapping(...)` for gene ID harmonisation.
2. **Configure BulkTrajBlend**
   - Instantiate `ov.bulk2single.BulkTrajBlend(bulk_seq=bulk_df, single_seq=adata, bulk_group=['dg_d_1','dg_d_2','dg_d_3'], celltype_key='clusters')`.
   - Explain that `bulk_group` names correspond to raw bulk columns and the method expects unscaled counts.
3. **Set beta-VAE expectations**
   - Call `bulktb.vae_configure(cell_target_num=100)` (or pass a dictionary) to define expected cell counts per cluster. Mention that omitting the argument triggers TAPE-based estimation.
4. **Train or load the beta-VAE**
   - Use `bulktb.vae_train(batch_size=512, learning_rate=1e-4, hidden_size=256, epoch_num=3500, vae_save_dir='...', vae_save_name='dg_btb_vae', generate_save_dir='...', generate_save_name='dg_btb')`.
   - Highlight resuming with `bulktb.vae_load('.../dg_btb_vae.pth')` and the need to regenerate cells with consistent random seeds for reproducibility.
5. **Generate synthetic cells**
   - Produce filtered AnnData via `bulktb.vae_generate(leiden_size=25)` and inspect compositions with `ov.bulk2single.bulk2single_plot_cellprop(...)`.
   - Save outputs to disk for reuse (`adata.write_h5ad`).
6. **Configure and train the GNN**
   - Call `bulktb.gnn_configure(max_epochs=2000, use_rep='X', neighbor_rep='X_pca', gpu=0, ...)` to set hyperparameters.
   - Train using `bulktb.gnn_train()`; reload checkpoints with `bulktb.gnn_load('save_model/gnn.pth')`.
   - Generate overlapping community assignments through `bulktb.gnn_generate()`.
7. **Visualise community structure**
   - Create MDE embeddings: `bulktb.nocd_obj.adata.obsm['X_mde'] = mde(bulktb.nocd_obj.adata.obsm['X_pca'])`.
   - Plot clusters vs. discovered communities using `sc.pl.embedding(..., color=['clusters','nocd_n'], palette=ov.utils.pyomic_palette())` and filtered subsets excluding synthetic labels with hyphens.
8. **Interpolate missing states**
   - Run `bulktb.interpolation('OPC')` (replace with target lineage) to synthesise continuity, then preprocess the interpolated AnnData (HVG selection, scaling, PCA).
   - Compute embeddings with `mde`, visualise with `ov.utils.embedding`, and compare to the original atlas.
9. **Analyse trajectories**
   - Initialise `ov.single.pyVIA` on both original and interpolated data to derive pseudotime, followed by `get_pseudotime`, `sc.pp.neighbors`, `ov.utils.cal_paga`, and `ov.utils.plot_paga` for topology validation.
10. **Troubleshooting tips**
    - If the VAE collapses (high reconstruction loss), lower `learning_rate` or reduce `hidden_size`.
    - Ensure the same generated dataset is used before calling `gnn_train`; regenerating cells changes the graph and can break checkpoint loading.
    - Sparse clusters may need adjusted `cell_target_num` thresholds or a smaller `leiden_size` filter to retain rare populations.

## Examples
- "Train BulkTrajBlend on PDAC cohorts, then interpolate missing OPC states in the trajectory."
- "Load saved beta-VAE and GNN weights to regenerate overlapping communities and plot cluster vs. nocd labels."
- "Run VIA on interpolated cells and compare PAGA graphs with the original scRNA-seq trajectory."

## References
- Tutorial notebook: [`t_bulktrajblend.ipynb`](../../omicverse_guide/docs/Tutorials-bulk2single/t_bulktrajblend.ipynb)
- Example datasets and checkpoints: [`omicverse_guide/docs/Tutorials-bulk2single/data/`](../../omicverse_guide/docs/Tutorials-bulk2single/data/)
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill extends single-cell developmental trajectories by synthesising intermediate cells from matched bulk RNA-seq using BulkTrajBlend. It trains a beta-VAE to generate realistic cell profiles, builds a graph neural network (GNN) to find overlapping communities, and then interpolates missing lineage states to restore trajectory continuity. The workflow integrates into scanpy/scvelo pipelines for downstream embedding and trajectory analysis.

How this skill works

The method deconvolves bulk samples against a single-cell reference, harmonises gene IDs, and trains a beta-VAE to generate synthetic cells that represent bulk-derived intermediate states. A GNN is trained on the augmented dataset to identify overlapping communities (nocd labels) that connect disconnected clusters. Targeted interpolation synthesises cells for a specified lineage, after which standard preprocessing, embedding (MDE/PCA), and pseudotime inference validate restored trajectories.

When to use it

You have matched bulk RNA-seq and single-cell data and suspect missing intermediate states in a developmental trajectory.
Single-cell clusters are disconnected or sparse, and bulk samples span transitional states not captured by single-cell profiling.
You want to test whether interpolated cells change topology or pseudotime ordering in VIA/PAGA analyses.
You need to generate reproducible synthetic cells for downstream differential or lineage analysis.
You plan to reuse trained models (VAE/GNN) across sessions to avoid retraining from scratch.

Best practices

Provide unscaled raw counts for bulk input and harmonise gene IDs before training.
Configure cell_target_num via vae_configure to reflect expected cluster sizes; omit to use TAPE-based estimates if unsure.
Save and version VAE and GNN checkpoints; regenerate synthetic cells with fixed random seeds for reproducibility.
Monitor VAE reconstruction loss; if collapse occurs, reduce learning_rate or hidden_size and retrain.
Keep a consistent generated dataset when training the GNN—regenerating cells will change graph structure and invalidate checkpoints.

Example use cases

Train BulkTrajBlend on PDAC cohorts to interpolate missing oligodendrocyte precursor (OPC) states and compare pseudotime before/after interpolation.
Load saved beta-VAE and GNN weights to reproduce overlapping community assignments and plot cluster vs nocd labels on an MDE embedding.
Interpolate a target lineage, run HVG/PCA/MDE, and compute VIA pseudotime and PAGA to validate restored topology.
Use synthetic cells to enhance rare population representation for downstream marker discovery or differential expression testing.

FAQ

What inputs are required?

You need a reference single-cell AnnData and raw bulk counts with matching gene IDs; use Matrix_ID_mapping for harmonisation.

How do I resume training or reuse models?

Save VAE/GNN checkpoints during training and reload with vae_load and gnn_load; ensure the generated dataset is identical when resuming GNN training.

What if the VAE collapses or loss is high?

Lower the learning_rate or reduce hidden_size, check batch_size, and ensure sufficient training epochs; inspect reconstruction metrics to tune hyperparameters.