home / skills / anthropics / life-sciences / scvi-tools

scvi-tools skill

/scvi-tools

This skill guides deep learning-based single-cell analysis with scvi-tools, enabling data integration, multi-modal workflows, and label transfer for scalable

This is most likely a fork of the scvi-tools skill from anthropics
npx playbooks add skill anthropics/life-sciences --skill scvi-tools

Review the files below or copy the command above to add this skill to your agents.

Files (21)
SKILL.md
7.2 KB
---
name: scvi-tools
description: Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
---

# scvi-tools Deep Learning Skill

This skill provides guidance for deep learning-based single-cell analysis using scvi-tools, the leading framework for probabilistic models in single-cell genomics.

## How to Use This Skill

1. Identify the appropriate workflow from the model/workflow tables below
2. Read the corresponding reference file for detailed steps and code
3. Use scripts in `scripts/` to avoid rewriting common code
4. For installation or GPU issues, consult `references/environment_setup.md`
5. For debugging, consult `references/troubleshooting.md`

## When to Use This Skill

- When scvi-tools, scVI, scANVI, or related models are mentioned
- When deep learning-based batch correction or integration is needed
- When working with multi-modal data (CITE-seq, multiome)
- When reference mapping or label transfer is required
- When analyzing ATAC-seq or spatial transcriptomics data
- When learning latent representations of single-cell data

## Model Selection Guide

| Data Type | Model | Primary Use Case |
|-----------|-------|------------------|
| scRNA-seq | **scVI** | Unsupervised integration, DE, imputation |
| scRNA-seq + labels | **scANVI** | Label transfer, semi-supervised integration |
| CITE-seq (RNA+protein) | **totalVI** | Multi-modal integration, protein denoising |
| scATAC-seq | **PeakVI** | Chromatin accessibility analysis |
| Multiome (RNA+ATAC) | **MultiVI** | Joint modality analysis |
| Spatial + scRNA reference | **DestVI** | Cell type deconvolution |
| RNA velocity | **veloVI** | Transcriptional dynamics |
| Cross-technology | **sysVI** | System-level batch correction |

## Workflow Reference Files

| Workflow | Reference File | Description |
|----------|---------------|-------------|
| Environment Setup | `references/environment_setup.md` | Installation, GPU, version info |
| Data Preparation | `references/data_preparation.md` | Formatting data for any model |
| scRNA Integration | `references/scrna_integration.md` | scVI/scANVI batch correction |
| ATAC-seq Analysis | `references/atac_peakvi.md` | PeakVI for accessibility |
| CITE-seq Analysis | `references/citeseq_totalvi.md` | totalVI for protein+RNA |
| Multiome Analysis | `references/multiome_multivi.md` | MultiVI for RNA+ATAC |
| Spatial Deconvolution | `references/spatial_deconvolution.md` | DestVI spatial analysis |
| Label Transfer | `references/label_transfer.md` | scANVI reference mapping |
| scArches Mapping | `references/scarches_mapping.md` | Query-to-reference mapping |
| Batch Correction | `references/batch_correction_sysvi.md` | Advanced batch methods |
| RNA Velocity | `references/rna_velocity_velovi.md` | veloVI dynamics |
| Troubleshooting | `references/troubleshooting.md` | Common issues and solutions |

## CLI Scripts

Modular scripts for common workflows. Chain together or modify as needed.

### Pipeline Scripts

| Script | Purpose | Usage |
|--------|---------|-------|
| `prepare_data.py` | QC, filter, HVG selection | `python scripts/prepare_data.py raw.h5ad prepared.h5ad --batch-key batch` |
| `train_model.py` | Train any scvi-tools model | `python scripts/train_model.py prepared.h5ad results/ --model scvi` |
| `cluster_embed.py` | Neighbors, UMAP, Leiden | `python scripts/cluster_embed.py adata.h5ad results/` |
| `differential_expression.py` | DE analysis | `python scripts/differential_expression.py model/ adata.h5ad de.csv --groupby leiden` |
| `transfer_labels.py` | Label transfer with scANVI | `python scripts/transfer_labels.py ref_model/ query.h5ad results/` |
| `integrate_datasets.py` | Multi-dataset integration | `python scripts/integrate_datasets.py results/ data1.h5ad data2.h5ad` |
| `validate_adata.py` | Check data compatibility | `python scripts/validate_adata.py data.h5ad --batch-key batch` |

### Example Workflow

```bash
# 1. Validate input data
python scripts/validate_adata.py raw.h5ad --batch-key batch --suggest

# 2. Prepare data (QC, HVG selection)
python scripts/prepare_data.py raw.h5ad prepared.h5ad --batch-key batch --n-hvgs 2000

# 3. Train model
python scripts/train_model.py prepared.h5ad results/ --model scvi --batch-key batch

# 4. Cluster and visualize
python scripts/cluster_embed.py results/adata_trained.h5ad results/ --resolution 0.8

# 5. Differential expression
python scripts/differential_expression.py results/model results/adata_clustered.h5ad results/de.csv --groupby leiden
```

### Python Utilities

The `scripts/model_utils.py` provides importable functions for custom workflows:

| Function | Purpose |
|----------|---------|
| `prepare_adata()` | Data preparation (QC, HVG, layer setup) |
| `train_scvi()` | Train scVI or scANVI |
| `evaluate_integration()` | Compute integration metrics |
| `get_marker_genes()` | Extract DE markers |
| `save_results()` | Save model, data, plots |
| `auto_select_model()` | Suggest best model |
| `quick_clustering()` | Neighbors + UMAP + Leiden |

## Critical Requirements

1. **Raw counts required**: scvi-tools models require integer count data
   ```python
   adata.layers["counts"] = adata.X.copy()  # Before normalization
   scvi.model.SCVI.setup_anndata(adata, layer="counts")
   ```

2. **HVG selection**: Use 2000-4000 highly variable genes
   ```python
   sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key="batch", layer="counts", flavor="seurat_v3")
   adata = adata[:, adata.var['highly_variable']].copy()
   ```

3. **Batch information**: Specify batch_key for integration
   ```python
   scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
   ```

## Quick Decision Tree

```
Need to integrate scRNA-seq data?
├── Have cell type labels? → scANVI (references/label_transfer.md)
└── No labels? → scVI (references/scrna_integration.md)

Have multi-modal data?
├── CITE-seq (RNA + protein)? → totalVI (references/citeseq_totalvi.md)
├── Multiome (RNA + ATAC)? → MultiVI (references/multiome_multivi.md)
└── scATAC-seq only? → PeakVI (references/atac_peakvi.md)

Have spatial data?
└── Need cell type deconvolution? → DestVI (references/spatial_deconvolution.md)

Have pre-trained reference model?
└── Map query to reference? → scArches (references/scarches_mapping.md)

Need RNA velocity?
└── veloVI (references/rna_velocity_velovi.md)

Strong cross-technology batch effects?
└── sysVI (references/batch_correction_sysvi.md)
```

## Key Resources

- [scvi-tools Documentation](https://docs.scvi-tools.org/)
- [scvi-tools Tutorials](https://docs.scvi-tools.org/en/stable/tutorials/index.html)
- [Model Hub](https://huggingface.co/scvi-tools)
- [GitHub Issues](https://github.com/scverse/scvi-tools/issues)

Overview

This skill guides deep learning-based single-cell analysis using scvi-tools. It covers model selection and end-to-end workflows for scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, and scArches. Use it to integrate datasets, correct batch effects, analyze multi-modal data, perform label transfer, and run downstream tasks like clustering and differential expression. Practical scripts and utilities are included to accelerate reproducible analyses.

How this skill works

The skill inspects your analysis goal (integration, multi-modal, ATAC, spatial, velocity, or reference mapping) and recommends the appropriate scvi-tools model and pipeline steps. It expects raw count matrices and annotated AnnData objects and provides scripts for data prep, model training, embedding, clustering, and DE testing. Utility functions automate common setup tasks (HVG selection, anndata validation, model setup) and save trained models for reference mapping. Troubleshooting and environment setup guidance address GPU and dependency issues.

When to use it

  • You need robust batch correction or dataset integration for scRNA-seq.
  • You have multi-modal data (CITE-seq, multiome RNA+ATAC) requiring joint modeling.
  • You want ATAC-seq chromatin accessibility analysis or spatial deconvolution.
  • You need reference mapping or label transfer between datasets.
  • You plan to learn latent representations or run RNA velocity with deep models.

Best practices

  • Provide raw integer counts in adata.layers['counts'] before normalization.
  • Select 2,000–4,000 HVGs and subset adata before training for stability.
  • Specify a batch_key during setup_anndata to enable proper batch correction.
  • Validate AnnData compatibility with provided validation script before training.
  • Use GPU when available and follow environment setup notes to avoid dependency issues.

Example use cases

  • Integrate multiple scRNA-seq batches with scVI and compare clustering across conditions.
  • Perform label transfer from a well-annotated reference to a new query with scANVI or scArches.
  • Denoise protein measurements and analyze joint RNA+protein variation with totalVI for CITE-seq.
  • Analyze scATAC-seq peak accessibility and latent structure with PeakVI.
  • Deconvolve spatial transcriptomics spots using DestVI and a single-cell reference.

FAQ

Do I need raw counts?

Yes. scvi-tools models require integer raw counts; store them in adata.layers['counts'] before normalization.

Which model for CITE-seq?

Use totalVI for joint RNA+protein modeling, denoising, and protein-level analyses.

Can I map a query dataset to a pretrained reference?

Yes. Use scArches or scANVI reference mapping workflows to map query data to a saved reference model.