home / skills / starlitnightly / omicverse / single-multiomics

single-multiomics skill

This skill provides quick, actionable guidance to integrate and visualize single-cell multi-omics data across MOFA, GLUE, SIMBA, TOSICA, and StaVIA.

npx playbooks add skill starlitnightly/omicverse --skill single-multiomics

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

5.8 KB

---
name: single-cell-multi-omics-integration
title: Single-cell multi-omics integration
description: Quick-reference sheet for OmicVerse tutorials spanning MOFA, GLUE pairing, SIMBA integration, TOSICA transfer, and StaVIA cartography.
---

# Single-Cell Multi-Omics Tutorials Cheat Sheet

This skill walk-through summarizes the OmicVerse notebooks that cover paired and unpaired multi-omic integration, multi-batch embedding, reference transfer, and trajectory cartography.

## MOFA on paired scRNA + scATAC (`t_mofa.ipynb`)
- **Data preparation:** Load preprocessed AnnData objects for RNA (`rna_p_n_raw.h5ad`) and ATAC (`atac_p_n_raw.h5ad`) with `ov.utils.read`, and initialise `pyMOFA` with matching `omics` and `omics_name` lists.
- **Model training:** Call `mofa_preprocess()` to select highly variable features and run the factor model with `mofa_run(outfile=...)`, which exports the learned MOFA+ factors to an HDF5 model file.
- **Result inspection:** Reload downstream AnnData, append factor scores via `ov.single.factor_exact`, and explore factor–cluster associations using `factor_correlation`, `get_weights`, and the plotting helpers in `pyMOFAART` (`plot_r2`, `plot_cor`, `plot_factor`, `plot_weights`, etc.).
- **Export workflow:** Persist factors and weights through the MOFA HDF5 artifact and reuse them by instantiating `pyMOFAART(model_path=...)` for later annotation or visualisation sessions.
- **Dependencies & hardware:** Requires `mofapy2`; plots optionally rely on `pymde`/`scvi-tools` but run on CPU.

## MOFA after GLUE pairing (`t_mofa_glue.ipynb`)
- **Data preparation:** Start from GLUE-derived embeddings (`rna-emb.h5ad`, `atac.emb.h5ad`), build a `GLUE_pair` object, and run `correlation()` to align unpaired cells before subsetting to highly variable features.
- **Model training:** Instantiate `pyMOFA` with the aligned AnnData objects, run `mofa_preprocess()`, and save the joint factors through `mofa_run(outfile='models/chen_rna_atac.hdf5')`.
- **Result inspection:** Use `pyMOFAART` plus AnnData that now contains the GLUE embeddings to compute factors (`get_factors`) and visualise variance explained, factor–cluster correlations, and ranked feature weights.
- **Export workflow:** Reuse the saved MOFA HDF5 model for downstream inspection; GLUE embeddings can be embedded with `scvi.model.utils.mde` (GPU-accelerated MDE is optional, `sc.tl.umap` works on CPU).
- **Dependencies & hardware:** Requires both `mofapy2` and the GLUE tooling (`scglue`, `scvi-tools`, `pymde`); GPU acceleration only affects optional MDE visualisation.

## SIMBA batch integration (`t_simba.ipynb`)
- **Data preparation:** Fetch the concatenated AnnData (`simba_adata_raw.h5ad`) derived from multiple pancreas studies and pass it, alongside a results directory, to `pySIMBA`.
- **Model training:** Execute `preprocess(...)` to bin features and build a SIMBA-compatible graph, then call `gen_graph()` followed by `train(num_workers=...)` to launch PyTorch-BigGraph optimisation (can scale with CPU workers) and `load(...)` to resume trained checkpoints.
- **Result inspection:** Apply `batch_correction()` to obtain the harmonised AnnData with SIMBA embeddings (`X_simba`) and visualise using `mde`/`sc.tl.umap` coloured by cell type or batch.
- **Export workflow:** Training outputs reside in the workdir (e.g., `result_human_pancreas/pbg/graph0`); reuse them with `simba_object.load(...)` for later analyses.
- **Dependencies & hardware:** Requires installing `simba` and `simba_pbg` (PyTorch BigGraph backend). GPU is optional; make sure adequate CPU threads and memory are available for graph training.

## TOSICA reference transfer (`t_tosica.ipynb`)
- **Data preparation:** Download demo AnnData references (`demo_train.h5ad`, `demo_test.h5ad`) and required gene-set GMT files via `ov.utils.download_tosica_gmt()`; confirm datasets are log-normalised before training.
- **Model training:** Create `pyTOSICA` with the reference AnnData, chosen pathway mask, label key, project directory, and batch size; train with `train(epochs=...)`, then persist weights with `save()` and optionally reload via `load()`.
- **Result inspection:** Generate predictions on query AnnData through `predicted(pre_adata=...)`, embed with OmicVerse preprocessing and GPU-enabled `mde` (UMAP fallback available), and explore pathway attention to interpret transformer heads.
- **Export workflow:** Saved project folder keeps model checkpoints and attention summaries; reuse the exported assets to annotate future datasets without retraining from scratch.
- **Dependencies & hardware:** Needs TOSICA (PyTorch transformer) plus downloaded gene-set masks; avoid setting `depth=2` if memory is constrained. GPU acceleration improves embedding (`mde`) but training runs on standard PyTorch (CPU/GPU depending on environment).

## StaVIA trajectory cartography (`t_stavia.ipynb`)
- **Data preparation:** Load example dentate gyrus velocity data via `scvelo.datasets.dentategyrus()`, preprocess with OmicVerse (`preprocess`, `scale`, `pca`, neighbours, UMAP) to populate the AnnData matrices used by VIA.
- **Model training:** Configure VIA hyperparameters (components, neighbours, seeds, root selection) and instantiate/run `VIA.core.VIA` on the chosen representation (`adata.obsm['scaled|original|X_pca']`).
- **Result inspection:** Store outputs such as pseudotime (`single_cell_pt_markov`), cluster graph abstractions, trajectory curves, atlas views, and stream plots through VIA plotting helpers.
- **Export workflow:** Persist derived visualisations and animations (e.g., `animate_streamplot_ov`, `animate_atlas`) to files (`.gif`) for reporting; recompute edge bundles via `make_edgebundle_milestone` when needed.
- **Dependencies & hardware:** Relies on `scvelo`, `pyVIA`, and OmicVerse plotting; computations are CPU-bound though producing large stream/animation outputs benefits from ample memory.

Overview

This skill is a quick-reference cheat sheet for OmicVerse Jupyter tutorials covering MOFA, GLUE pairing, SIMBA batch integration, TOSICA reference transfer, and StaVIA trajectory cartography. It condenses setup steps, model training commands, result inspection tips, export workflows, and dependency/hardware notes across paired and unpaired multi-omic scenarios. Use it to find the right notebook and the minimal commands to reproduce core analyses.

How this skill works

Each notebook demonstrates a focused pipeline: prepare AnnData inputs, instantiate the relevant OmicVerse wrapper (pyMOFA, GLUE_pair, pySIMBA, pyTOSICA, or VIA), run preprocessing and training functions, then export factors, embeddings, or model checkpoints. The cheat sheet highlights which files to load/save, which helper functions to call for visualization, and which optional GPU-accelerated steps (MDE, transformer training, or embedding) can speed up workflows.

When to use it

Integrating paired scRNA and scATAC to get joint latent factors (use MOFA).
Aligning unpaired modalities with GLUE then learning joint factors (MOFA on GLUE embeddings).
Correcting batch effects across multi-study single-cell datasets and deriving unified embeddings (SIMBA).
Transferring cell-type labels or pathway-informed attention from a reference to a query dataset (TOSICA).
Reconstructing cell-state trajectories, pseudotime, and atlas visualizations from RNA velocity or PCA space (StaVIA/VIA).

Best practices

Always load preprocessed, log-normalized AnnData objects and confirm matching feature names between modalities before training.
Persist model artifacts (MOFA HDF5, SIMBA workdir, TOSICA project folder) to reuse without retraining.
Subset to highly variable features or apply GLUE pairing alignment before factor modelling to reduce noise and speed training.
Use GPU for optional MDE or transformer-heavy steps but ensure CPU threads and memory are sufficient for graph training or large animations.
Inspect variance explained, factor–cluster correlations, and ranked feature weights to validate biological signal before downstream annotation.

Example use cases

Derive joint RNA–ATAC factors from paired assays to discover regulatory programs with pyMOFA.
Use GLUE to align unpaired scRNA and scATAC and then run MOFA to capture shared latent factors.
Integrate multiple pancreas studies with SIMBA to harmonize batches and produce a single embedding for cell-type mapping.
Train TOSICA on a curated reference to transfer cell-type labels and interpret pathway attention on new samples.
Run StaVIA on velocity-derived embeddings to map developmental trajectories and export animated stream plots for presentations.

FAQ

Do I always need a GPU to run these notebooks?

No. Core model training for MOFA, SIMBA (PyTorch-BigGraph), TOSICA, and VIA can run on CPU. GPU accelerates optional steps like MDE visualization and transformer training but is not strictly required.

How do I reuse saved models or embeddings?

Load the exported artifacts: MOFA HDF5 via pyMOFAART(model_path=...), SIMBA checkpoints with simba_object.load(...), and TOSICA project folders with load()/predicted() to annotate new AnnData without retraining.