home / skills / starlitnightly / omicverse / single-popv-annotation

single-popv-annotation skill

/.claude/skills/single-popv-annotation

This skill consolidates up to 10 cell-type classifiers with consensus voting to annotate single-cell data robustly.

npx playbooks add skill starlitnightly/omicverse --skill single-popv-annotation

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
9.8 KB
---
name: single-popv-annotation
title: PopV population-level cell type annotation
description: "PopV population-level cell annotation: 10 algorithms (SCVI, SCANVI, CellTypist, OnClass, RF, SVM, XGBoost, BBKNN, HARMONY, SCANORAMA), consensus voting, pretrained hub models."
---

# PopV Population-Level Cell Type Annotation

PopV (Population Voting) annotates cell types by running up to 10 classification algorithms and aggregating predictions via majority voting. Unlike single-method annotation (SCSA, MetaTiME, CellTypist alone), PopV produces a consensus prediction that is more robust to individual algorithm failures. The module also supports ontology-aware voting via the Cell Ontology (CL) for hierarchical label resolution.

## Defensive Validation

```python
# Before PopV: verify reference has the cell type column
assert ref_labels_key in ref_adata.obs.columns, \
    f"ref_adata.obs['{ref_labels_key}'] not found. Available: {list(ref_adata.obs.columns)}"

# Verify no NaN in reference labels
assert ref_adata.obs[ref_labels_key].notna().all(), \
    f"NaN values in ref_adata.obs['{ref_labels_key}']. Use fillna() or drop these cells."

# Verify gene overlap
overlap = query_adata.var_names.intersection(ref_adata.var_names)
assert len(overlap) > 100, \
    f"Only {len(overlap)} overlapping genes between query and reference. Check var_names format (ENSEMBL vs symbol)."
```

## Stage 1: Data Preparation

```python
import omicverse as ov

# Process_Query preprocesses and concatenates query + reference
process_obj = ov.popv.Process_Query(
    query_adata=query_adata,
    ref_adata=ref_adata,
    ref_labels_key='cell_type',           # REQUIRED: column in ref_adata.obs
    ref_batch_key='batch',                # batch column in ref_adata.obs
    query_batch_key='batch',              # batch column in query_adata.obs (optional)
    cl_obo_folder=False,                  # False to skip ontology, or path to CL .obo file
    prediction_mode='retrain',            # 'retrain' | 'inference' | 'fast'
    unknown_celltype_label='unknown',     # label for query cells
    n_samples_per_label=300,              # subsample reference per cell type
    hvg=4000,                             # number of highly variable genes
    save_path_trained_models='tmp/',      # where to save models
    pretrained_scvi_path=None,            # path to pretrained scVI model (optional)
)
```

**prediction_mode choices:**
- `'retrain'` — Train all models from scratch on reference+query. Most accurate, slowest.
- `'inference'` — Load previously saved models. Requires `save_path_trained_models` from prior run.
- `'fast'` — Skip integration-heavy algorithms. Uses FAST_ALGORITHMS subset.

**Preprocessing applied automatically:**
- Filters cells with < 30 total counts
- Log1p normalization (target_sum=1e4)
- PCA on reference (50 components)
- Stores raw counts in `layers['scvi_counts']`

## Stage 2: Annotation

```python
# Run all algorithms and compute consensus
ov.popv.annotate_data(
    process_obj.adata,
    methods='all',                        # or list of specific algorithms
    save_path='results/popv/',            # saves predictions.csv here
    methods_kwargs=None,                  # dict of per-method overrides
)
```

### Available Algorithms (10 total)

| Algorithm | Result Key | Type | Speed |
|-----------|-----------|------|-------|
| `KNN_SCVI` | `popv_knn_on_scvi_prediction` | Deep learning + KNN | Medium |
| `SCANVI_POPV` | `popv_scanvi_prediction` | Semi-supervised DL | Medium |
| `CELLTYPIST` | `popv_celltypist_prediction` | Logistic regression | Fast |
| `ONCLASS` | `popv_onclass_prediction` | Ontology-guided | Medium |
| `Support_Vector` | `popv_svm_prediction` | SVM | Fast |
| `XGboost` | `popv_xgboost_prediction` | Gradient boosting | Fast |
| `KNN_HARMONY` | `popv_knn_harmony_prediction` | Harmony + KNN | Fast |
| `KNN_BBKNN` | `popv_knn_bbknn_prediction` | BBKNN + KNN | Fast |
| `Random_Forest` | `popv_rf_prediction` | Random forest | Fast |
| `KNN_SCANORAMA` | `popv_knn_scanorama_prediction` | Scanorama + KNN | Medium |

**Algorithm subsets:**
- `FAST_ALGORITHMS`: KNN_SCVI, SCANVI_POPV, Support_Vector, XGboost, ONCLASS, CELLTYPIST (used with `prediction_mode='fast'`)
- `CURRENT_ALGORITHMS`: All except Random_Forest and KNN_SCANORAMA (outdated)
- `'all'` or `None`: Uses CURRENT_ALGORITHMS (or FAST_ALGORITHMS in fast mode)

### Selecting Specific Methods

```python
# Run only fast classical methods
ov.popv.annotate_data(
    process_obj.adata,
    methods=['CELLTYPIST', 'Support_Vector', 'XGboost'],
)

# Override per-method parameters
ov.popv.annotate_data(
    process_obj.adata,
    methods=['KNN_SCVI', 'SCANVI_POPV'],
    methods_kwargs={
        'KNN_SCVI': {'train_kwargs': {'max_epochs': 50}},
        'SCANVI_POPV': {'train_kwargs': {'max_epochs': 50}},
    },
)
```

## Stage 3: Consensus Results & Visualization

After `annotate_data()`, these columns appear in `adata.obs`:

| Column | Description |
|--------|-------------|
| `popv_majority_vote_prediction` | Majority vote across all methods |
| `popv_majority_vote_score` | Number of agreeing methods |
| `popv_prediction` | Ontology-aggregated consensus (if CL enabled) |
| `popv_prediction_score` | Ontology consensus score |

```python
# Agreement plots: confusion matrices per method vs consensus
ov.popv.make_agreement_plots(
    process_obj.adata,
    prediction_keys=process_obj.adata.uns['prediction_keys'],
    popv_prediction_key='popv_prediction',
    save_folder='results/popv/',
    show=True,
)

# Bar plot: agreement score per cell type
ov.popv.agreement_score_bar_plot(
    process_obj.adata,
    popv_prediction_key='popv_prediction',
    save_folder='results/popv/',
)

# Bar plot: prediction score distribution
ov.popv.prediction_score_bar_plot(
    process_obj.adata,
    popv_prediction_score='popv_prediction_score',
    save_folder='results/popv/',
)

# Bar plot: cell type proportions (ref vs query)
ov.popv.celltype_ratio_bar_plot(
    process_obj.adata,
    popv_prediction='popv_prediction',
    save_folder='results/popv/',
)
```

## Stage 4: Pretrained Hub Models (Optional)

For large references (e.g., Human Cell Atlas), use pretrained models to skip training:

```python
from omicverse.popv.hub import HubModel

# Pull pretrained model from HuggingFace
model = HubModel.pull_from_huggingface_hub(
    repo_name='popv/immune_all',
    cache_dir='models/popv/',
)

# Annotate query data directly (fast mode)
result_adata = model.annotate_data(
    query_adata=query_adata,
    query_batch_key='batch',
    prediction_mode='fast',
    methods=None,  # uses model's default methods
)
```

## Critical API Reference

```python
# CORRECT: methods as list of strings matching class names
ov.popv.annotate_data(adata, methods=['KNN_SCVI', 'CELLTYPIST', 'Support_Vector'])

# WRONG: passing class objects or lowercase names
# ov.popv.annotate_data(adata, methods=[KNN_SCVI, CELLTYPIST])  # TypeError
# ov.popv.annotate_data(adata, methods=['knn_scvi'])             # KeyError

# CORRECT: ref_labels_key must exist in ref_adata.obs before Process_Query
assert 'cell_type' in ref_adata.obs.columns
process_obj = ov.popv.Process_Query(ref_labels_key='cell_type', ...)

# WRONG: forgetting to set unknown_celltype_label causes NaN in voting
# process_obj = ov.popv.Process_Query(..., unknown_celltype_label=None)  # NaN errors

# CORRECT: access consensus results after annotation
final_labels = process_obj.adata.obs['popv_majority_vote_prediction']
# or ontology-refined:
final_labels = process_obj.adata.obs['popv_prediction']

# WRONG: looking for results on the original query_adata
# query_adata.obs['popv_prediction']  # KeyError: results are on process_obj.adata
```

## GPU Acceleration

```python
import omicverse.popv as popv
popv.settings.accelerator = 'gpu'   # for scVI/scANVI training
popv.settings.cuml = True           # for KNN/SVM/RF via cuML
popv.settings.n_jobs = 10           # parallel jobs for CPU methods
```

## Troubleshooting

- **`RuntimeError: CUDA out of memory` during scVI/scANVI training**: Reduce `hvg` (try 2000), decrease `n_samples_per_label` (try 100), or switch to `prediction_mode='fast'` which uses fewer epochs.
- **CellTypist model download fails**: Set `methods_kwargs={'CELLTYPIST': {'method_kwargs': {'model': '/path/to/local/model.pkl'}}}` to use a local model file.
- **Low consensus agreement (<50% cells agree)**: Some algorithms may not suit your tissue. Exclude underperforming methods: check per-method predictions and drop outliers from the `methods` list.
- **`KeyError: 'gene_name'` — gene identifier mismatch**: Harmonize var_names between reference and query before calling `Process_Query`. Use `adata.var_names = adata.var['gene_symbols']` if ENSEMBL IDs are in var_names.
- **`ValueError: batch_key contains NaN`**: Clean batch columns before PopV. Apply the batch validation pattern from the single-preprocessing skill: `adata.obs['batch'] = adata.obs['batch'].fillna('unknown').astype('category')`.
- **`FileNotFoundError` in inference mode**: Ensure `save_path_trained_models` points to the same directory used during the original `retrain` run. Check that model files (.pt, .pkl, .joblib) exist.

## Dependencies
- Core: `omicverse`, `scanpy`, `anndata`, `numpy`, `pandas`
- Deep learning: `scvi-tools`, `torch` (for KNN_SCVI, SCANVI_POPV)
- Classical ML: `scikit-learn`, `xgboost` (for RF, SVM, XGBoost)
- Integration: `harmonypy`, `bbknn`, `scanorama` (for respective KNN methods)
- Annotation: `celltypist`, `OnClass` (optional per method)
- Ontology: `obonet`, `pronto` (for ontology-aware voting)
- Hub: `huggingface_hub` (for pretrained models)

## Examples
- "Annotate my PBMC query data against a reference atlas using PopV with all 10 algorithms and visualize the consensus."
- "Use a pretrained PopV hub model to quickly annotate my lung tissue scRNA-seq data."
- "Run PopV with only classical methods (SVM, XGBoost, CellTypist) to annotate my query cells without GPU."

## References
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill provides population-level cell type annotation by running up to ten classification algorithms and aggregating their outputs into a robust consensus. It supports retraining, inference from saved models, and fast modes that skip heavy integration steps. Ontology-aware voting via the Cell Ontology is available, and pretrained hub models can be pulled to skip training on large references.

How this skill works

The workflow first validates inputs and prepares data by filtering low-count cells, normalizing, selecting highly variable genes, and performing PCA. It then runs a configurable set of algorithms (deep learning, semi-supervised, classical ML, and integration-based KNNs) and stores each method’s predictions. A majority-vote consensus and optional ontology-aggregated consensus are computed and saved in the AnnData object. Visualization helpers generate agreement matrices, score distributions, and cell type ratio plots.

When to use it

  • Annotating query single-cell datasets against a curated reference atlas
  • Seeking robust labels by combining multiple classification algorithms
  • Using pretrained models to annotate large references without retraining
  • Quick annotation with classical methods when GPUs are unavailable
  • Resolving hierarchical labels using Cell Ontology awareness

Best practices

  • Ensure reference labels column exists and contains no NaN before processing
  • Harmonize gene identifiers (symbols vs ENSEMBL) to maximize gene overlap (>100 genes recommended)
  • Choose prediction_mode='retrain' for best accuracy, 'inference' to reuse saved models, or 'fast' for quick runs
  • Subsample large references (n_samples_per_label) and reduce HVG if GPU memory is limited
  • Inspect per-method predictions and exclude poorly performing algorithms to raise consensus agreement

Example use cases

  • Annotate PBMC query cells against a reference atlas and produce consensus labels and agreement plots
  • Use a pretrained hub model to rapidly label lung scRNA-seq without local training
  • Run a fast classical-only pipeline (SVM, XGBoost, CellTypist) when GPU training is not available
  • Compare method-level predictions to identify algorithms that disagree on specific cell types

FAQ

What prediction_mode should I pick?

Use 'retrain' for highest accuracy if you can train models; choose 'inference' to reuse previously saved models; pick 'fast' to skip integration-heavy methods and get results quickly.

Why do I get low consensus agreement?

Low agreement often means some algorithms are mismatched to the tissue. Inspect per-method outputs, remove underperforming methods, or improve reference quality (more samples per label, clearer annotations).

How do I avoid CUDA out-of-memory errors?

Reduce HVG (e.g., to 2000), lower n_samples_per_label, switch to 'fast' mode, or run on CPU by disabling GPU in settings.