home / skills / gptomics / bioskills / atlas-mapping

atlas-mapping skill

/machine-learning/atlas-mapping

This skill maps single-cell data onto reference atlases using scArches, enabling label transfer with surgical fine-tuning and no retraining.

npx playbooks add skill gptomics/bioskills --skill atlas-mapping

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.8 KB
---
name: bio-machine-learning-atlas-mapping
description: Maps query single-cell data to reference atlases using scArches transfer learning with scVI and scANVI models. Transfers cell type labels without retraining on combined data. Use when annotating new single-cell datasets using pre-trained reference models.
tool_type: python
primary_tool: scvi-tools
---

## Version Compatibility

Reference examples tested with: anndata 0.10+, scanpy 1.10+, scvi-tools 1.1+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Transfer Learning for Single-Cell Data

**"Map my scRNA-seq data onto a reference atlas"** → Transfer cell type labels from a pre-trained reference model to query cells using architectural surgery (scArches) without retraining.
- Python: `scvi.model.SCVI.load_query_data()` → `get_latent_representation()` → `scanpy.tl.ingest()`

## scVI Reference Mapping (scArches)

**Goal:** Map query single-cell data onto a pre-trained reference model to obtain a shared latent embedding.

**Approach:** Load a pre-trained scVI model, prepare query data with matching gene sets, then perform surgical fine-tuning that updates only query-specific parameters.

```python
import scvi
import scanpy as sc

# Load pre-trained reference model
adata_ref = sc.read_h5ad('reference.h5ad')
# Model must have been saved with save_anndata=True
scvi.model.SCVI.setup_anndata(adata_ref, layer='counts', batch_key='batch')
ref_model = scvi.model.SCVI.load('reference_model/', adata=adata_ref)

# Prepare query data
adata_query = sc.read_h5ad('query.h5ad')
# Subset to reference genes
adata_query = adata_query[:, adata_ref.var_names].copy()

# Set up query AnnData using reference setup
scvi.model.SCVI.prepare_query_anndata(adata_query, ref_model)

# Load query into model (creates "surgical" fine-tuned model)
query_model = scvi.model.SCVI.load_query_data(adata_query, ref_model)

# Surgical training: update only query-specific parameters
# weight_decay=0.0: Standard for surgery; prevents reference drift
query_model.train(max_epochs=200, plan_kwargs={'weight_decay': 0.0})

# Get latent representation
adata_query.obsm['X_scVI'] = query_model.get_latent_representation()
```

## scANVI for Label Transfer

**Goal:** Transfer cell type labels from a labeled reference atlas to unlabeled query data.

**Approach:** Train a semi-supervised scANVI model on the reference, then map query cells via surgical fine-tuning and predict labels with confidence scores.

```python
import scvi
import scanpy as sc

# Reference with cell type labels
adata_ref = sc.read_h5ad('reference_labeled.h5ad')
scvi.model.SCVI.setup_anndata(adata_ref, layer='counts', batch_key='batch')
ref_vae = scvi.model.SCVI(adata_ref, n_latent=30)
ref_vae.train(max_epochs=100)

# Convert to scANVI (semi-supervised)
scvi.model.SCANVI.setup_anndata(adata_ref, layer='counts', batch_key='batch', labels_key='cell_type', unlabeled_category='Unknown')
ref_scanvi = scvi.model.SCANVI.from_scvi_model(ref_vae, labels_key='cell_type', unlabeled_category='Unknown')
ref_scanvi.train(max_epochs=50)
ref_scanvi.save('reference_scanvi/')

# Map query data
adata_query = sc.read_h5ad('query.h5ad')
adata_query = adata_query[:, adata_ref.var_names].copy()

scvi.model.SCANVI.prepare_query_anndata(adata_query, ref_scanvi)
query_scanvi = scvi.model.SCANVI.load_query_data(adata_query, ref_scanvi)
query_scanvi.train(max_epochs=100, plan_kwargs={'weight_decay': 0.0})

# Transfer labels
adata_query.obs['predicted_cell_type'] = query_scanvi.predict()
adata_query.obsm['X_scANVI'] = query_scanvi.get_latent_representation()
```

## Prediction Confidence

**Goal:** Assess reliability of transferred labels and flag cells that may represent novel types.

**Approach:** Extract soft prediction probabilities from the scANVI model and identify low-confidence assignments below a threshold.

```python
# Get prediction probabilities
soft_predictions = query_scanvi.predict(soft=True)
adata_query.obs['prediction_confidence'] = soft_predictions.max(axis=1)

# Flag low-confidence predictions
# confidence < 0.5: May be novel cell type or poor mapping
low_conf = adata_query.obs['prediction_confidence'] < 0.5
print(f'Low confidence predictions: {low_conf.sum()} ({low_conf.mean():.1%})')
```

## Joint Embedding Visualization

**Goal:** Visualize reference and query cells together to assess integration quality.

**Approach:** Concatenate reference and query datasets, compute UMAP from the shared latent representation, and color by dataset and cell type.

```python
import scanpy as sc

# Combine reference and query for visualization
adata_combined = adata_ref.concatenate(adata_query, batch_key='dataset', batch_categories=['reference', 'query'])

# Use latent space for neighbors/UMAP
sc.pp.neighbors(adata_combined, use_rep='X_scVI')
sc.tl.umap(adata_combined)
sc.pl.umap(adata_combined, color=['dataset', 'cell_type'], save='_transfer.png')
```

## Pre-trained Reference Atlases

| Atlas | Model | URL |
|-------|-------|-----|
| Human Lung Cell Atlas | scANVI | cellxgene.cziscience.com |
| Tabula Sapiens | scVI | tabula-sapiens-portal.ds.czbiohub.org |
| Mouse Cell Atlas | scVI | bis.zju.edu.cn/MCA |

## Training Parameters

| Parameter | Surgical | Full Retrain | Notes |
|-----------|----------|--------------|-------|
| weight_decay | 0.0 | 0.001 | 0.0 preserves reference |
| max_epochs | 100-200 | 200-400 | Less for surgery |
| early_stopping | True | True | Prevents overfitting |

## Related Skills

- single-cell/cell-annotation - Manual annotation methods
- single-cell/batch-integration - Batch effect correction
- single-cell/preprocessing - Data preparation before transfer

Overview

This skill maps query single-cell data onto pre-trained reference atlases using scArches transfer learning with scVI and scANVI models. It transfers cell type labels and produces a shared latent embedding without retraining on combined datasets. The workflow supports surgical fine-tuning to avoid reference drift and returns prediction confidences and joint embeddings for visualization.

How this skill works

Load a saved scVI or scANVI reference model and prepare the query AnnData to match the reference gene set. Use scvi.model.SCVI.prepare_query_anndata / SCANVI.prepare_query_anndata and load the query with load_query_data to perform architectural surgery that updates only query-specific parameters. Train briefly with weight_decay=0.0 to preserve the reference, then extract latent embeddings and perform label transfer using scANVI.predict with soft probabilities to assess confidence.

When to use it

  • Annotating a new scRNA-seq dataset using an existing labeled reference atlas.
  • Projecting query cells into a shared latent space for joint visualization with reference cells.
  • Transferring cell type labels rapidly without retraining a combined model.
  • Identifying low-confidence or potentially novel cell states via prediction probabilities.
  • Integrating query data when maintaining the integrity of a curated reference is important.

Best practices

  • Ensure query AnnData is subset to the reference var_names and uses the same preprocessing (counts or normalized layer).
  • Verify package versions (anndata, scanpy, scvi-tools) and adapt calls if APIs differ.
  • Use weight_decay=0.0 during surgical fine-tuning to prevent reference parameter drift.
  • Train query model for fewer epochs than full retrain (100–200) and enable early stopping.
  • Inspect soft prediction probabilities and flag cells below a chosen confidence threshold (e.g., 0.5).

Example use cases

  • Map a new lung scRNA-seq cohort to a Human Lung Cell Atlas scANVI reference to transfer cell type labels.
  • Project patient biopsy samples into a Tabula Sapiens scVI latent space for comparative visualization.
  • Annotate a mouse dataset against a pre-trained Mouse Cell Atlas and identify candidate novel clusters by low-confidence predictions.
  • Create joint UMAPs of reference + query to evaluate integration quality and batch alignment.

FAQ

Do I need to retrain the reference model with my query data?

No. Architectural surgery (load_query_data + brief fine-tuning) updates only query-specific parameters, preserving the reference.

How do I decide if a transferred label is reliable?

Use scANVI.predict(soft=True) to get per-cell probabilities and flag cells below a confidence threshold (commonly 0.5) as low-confidence or potentially novel.