home / skills / jackspace / claudeskillz / scientific-pkg-scvi-tools

scientific-pkg-scvi-tools skill

safe

This skill helps you analyze and integrate single-cell data with probabilistic models, batch correction, and multimodal analysis using scvi-tools.

npx playbooks add skill jackspace/claudeskillz --skill scientific-pkg-scvi-tools

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

7.1 KB

---
name: scvi-tools
description: This skill should be used when working with single-cell omics data analysis using scvi-tools, including scRNA-seq, scATAC-seq, CITE-seq, spatial transcriptomics, and other single-cell modalities. Use this skill for probabilistic modeling, batch correction, dimensionality reduction, differential expression, cell type annotation, multimodal integration, and spatial analysis tasks.
---

# scvi-tools

## Overview

scvi-tools is a comprehensive Python framework for probabilistic models in single-cell genomics. Built on PyTorch and PyTorch Lightning, it provides deep generative models using variational inference for analyzing diverse single-cell data modalities.

## When to Use This Skill

Use this skill when:
- Analyzing single-cell RNA-seq data (dimensionality reduction, batch correction, integration)
- Working with single-cell ATAC-seq or chromatin accessibility data
- Integrating multimodal data (CITE-seq, multiome, paired/unpaired datasets)
- Analyzing spatial transcriptomics data (deconvolution, spatial mapping)
- Performing differential expression analysis on single-cell data
- Conducting cell type annotation or transfer learning tasks
- Working with specialized single-cell modalities (methylation, cytometry, RNA velocity)
- Building custom probabilistic models for single-cell analysis

## Core Capabilities

scvi-tools provides models organized by data modality:

### 1. Single-Cell RNA-seq Analysis
Core models for expression analysis, batch correction, and integration. See `references/models-scrna-seq.md` for:
- **scVI**: Unsupervised dimensionality reduction and batch correction
- **scANVI**: Semi-supervised cell type annotation and integration
- **AUTOZI**: Zero-inflation detection and modeling
- **VeloVI**: RNA velocity analysis
- **contrastiveVI**: Perturbation effect isolation

### 2. Chromatin Accessibility (ATAC-seq)
Models for analyzing single-cell chromatin data. See `references/models-atac-seq.md` for:
- **PeakVI**: Peak-based ATAC-seq analysis and integration
- **PoissonVI**: Quantitative fragment count modeling
- **scBasset**: Deep learning approach with motif analysis

### 3. Multimodal & Multi-omics Integration
Joint analysis of multiple data types. See `references/models-multimodal.md` for:
- **totalVI**: CITE-seq protein and RNA joint modeling
- **MultiVI**: Paired and unpaired multi-omic integration
- **MrVI**: Multi-resolution cross-sample analysis

### 4. Spatial Transcriptomics
Spatially-resolved transcriptomics analysis. See `references/models-spatial.md` for:
- **DestVI**: Multi-resolution spatial deconvolution
- **Stereoscope**: Cell type deconvolution
- **Tangram**: Spatial mapping and integration
- **scVIVA**: Cell-environment relationship analysis

### 5. Specialized Modalities
Additional specialized analysis tools. See `references/models-specialized.md` for:
- **MethylVI/MethylANVI**: Single-cell methylation analysis
- **CytoVI**: Flow/mass cytometry batch correction
- **Solo**: Doublet detection
- **CellAssign**: Marker-based cell type annotation

## Typical Workflow

All scvi-tools models follow a consistent API pattern:

```python
# 1. Load and preprocess data (AnnData format)
import scvi
import scanpy as sc

adata = scvi.data.heart_cell_atlas_subsampled()
sc.pp.filter_genes(adata, min_counts=3)
sc.pp.highly_variable_genes(adata, n_top_genes=1200)

# 2. Register data with model (specify layers, covariates)
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",  # Use raw counts, not log-normalized
    batch_key="batch",
    categorical_covariate_keys=["donor"],
    continuous_covariate_keys=["percent_mito"]
)

# 3. Create and train model
model = scvi.model.SCVI(adata)
model.train()

# 4. Extract latent representations and normalized values
latent = model.get_latent_representation()
normalized = model.get_normalized_expression(library_size=1e4)

# 5. Store in AnnData for downstream analysis
adata.obsm["X_scVI"] = latent
adata.layers["scvi_normalized"] = normalized

# 6. Downstream analysis with scanpy
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.tl.leiden(adata)
```

**Key Design Principles:**
- **Raw counts required**: Models expect unnormalized count data for optimal performance
- **Unified API**: Consistent interface across all models (setup → train → extract)
- **AnnData-centric**: Seamless integration with the scanpy ecosystem
- **GPU acceleration**: Automatic utilization of available GPUs
- **Batch correction**: Handle technical variation through covariate registration

## Common Analysis Tasks

### Differential Expression
Probabilistic DE analysis using the learned generative models:

```python
de_results = model.differential_expression(
    groupby="cell_type",
    group1="TypeA",
    group2="TypeB",
    mode="change",  # Use composite hypothesis testing
    delta=0.25      # Minimum effect size threshold
)
```

See `references/differential-expression.md` for detailed methodology and interpretation.

### Model Persistence
Save and load trained models:

```python
# Save model
model.save("./model_directory", overwrite=True)

# Load model
model = scvi.model.SCVI.load("./model_directory", adata=adata)
```

### Batch Correction and Integration
Integrate datasets across batches or studies:

```python
# Register batch information
scvi.model.SCVI.setup_anndata(adata, batch_key="study")

# Model automatically learns batch-corrected representations
model = scvi.model.SCVI(adata)
model.train()
latent = model.get_latent_representation()  # Batch-corrected
```

## Theoretical Foundations

scvi-tools is built on:
- **Variational inference**: Approximate posterior distributions for scalable Bayesian inference
- **Deep generative models**: VAE architectures that learn complex data distributions
- **Amortized inference**: Shared neural networks for efficient learning across cells
- **Probabilistic modeling**: Principled uncertainty quantification and statistical testing

See `references/theoretical-foundations.md` for detailed background on the mathematical framework.

## Additional Resources

- **Workflows**: `references/workflows.md` contains common workflows, best practices, hyperparameter tuning, and GPU optimization
- **Model References**: Detailed documentation for each model category in the `references/` directory
- **Official Documentation**: https://docs.scvi-tools.org/en/stable/
- **Tutorials**: https://docs.scvi-tools.org/en/stable/tutorials/index.html
- **API Reference**: https://docs.scvi-tools.org/en/stable/api/index.html

## Installation

```bash
pip install scvi-tools
# For GPU support
pip install scvi-tools[cuda]
```

## Best Practices

1. **Use raw counts**: Always provide unnormalized count data to models
2. **Filter genes**: Remove low-count genes before analysis (e.g., `min_counts=3`)
3. **Register covariates**: Include known technical factors (batch, donor, etc.) in `setup_anndata`
4. **Feature selection**: Use highly variable genes for improved performance
5. **Model saving**: Always save trained models to avoid retraining
6. **GPU usage**: Enable GPU acceleration for large datasets (`accelerator="gpu"`)
7. **Scanpy integration**: Store outputs in AnnData objects for downstream analysis

Overview

This skill encapsulates scvi-tools for probabilistic single-cell omics analysis. It provides deep generative models built on PyTorch to perform batch correction, dimensionality reduction, multimodal integration, differential expression, and spatial analysis across diverse single-cell modalities. Use it to get uncertainty-aware latent representations and principled downstream statistics.

How this skill works

Models follow a consistent API: prepare AnnData with raw count layers, register covariates with setup_anndata, instantiate a model, train, and extract latent embeddings or normalized expression. Under the hood scvi-tools uses variational inference and VAE-style generative networks to model cell-level distributions, enabling batch-aware embeddings, probabilistic differential expression, and modality-specific decoders. GPU acceleration and AnnData integration make it practical for large datasets and downstream Scanpy workflows.

When to use it

Dimensionality reduction and batch correction for scRNA-seq across studies or donors
Integrating multimodal data (CITE-seq, multiome, paired/unpaired modalities)
Analyzing single-cell chromatin accessibility (scATAC) or deconvolving spatial transcriptomics
Performing probabilistic differential expression with uncertainty estimates
Cell type annotation and transfer learning using semi-supervised models

Best practices

Always feed raw count matrices (not log-normalized) as model inputs
Filter low-count genes and select highly variable genes before training
Register known technical covariates (batch, donor, percent_mito) in setup_anndata
Save trained models for reproducibility and reuse (model.save / model.load)
Use GPU acceleration for large datasets and tune model training parameters

Example use cases

Train scVI to obtain batch-corrected latent space for UMAP embedding and clustering
Use totalVI to jointly model RNA and protein counts from CITE-seq and compute denoised protein expression
Apply PeakVI or PoissonVI for single-cell ATAC integration and peak-level analyses
Run DestVI or Tangram to map cell-type compositions into spatial transcriptomics coordinates
Perform model.differential_expression to compare cell type expression with effect size thresholds

FAQ

Do models require raw counts or can I use pre-normalized data?

Models expect raw, unnormalized count data for correct probabilistic modeling; use counts in an AnnData layer and avoid log-normalized inputs.

Can I use scvi-tools with GPUs and how do I enable them?

Yes. scvi-tools automatically uses available GPUs. Specify accelerator or use PyTorch Lightning settings (e.g., accelerator='gpu') during training for better performance on large datasets.