home / skills / starlitnightly / omicverse / single-popv-annotation
This skill consolidates up to 10 cell-type classifiers with consensus voting to annotate single-cell data robustly.
npx playbooks add skill starlitnightly/omicverse --skill single-popv-annotationReview the files below or copy the command above to add this skill to your agents.
---
name: single-popv-annotation
title: PopV population-level cell type annotation
description: "PopV population-level cell annotation: 10 algorithms (SCVI, SCANVI, CellTypist, OnClass, RF, SVM, XGBoost, BBKNN, HARMONY, SCANORAMA), consensus voting, pretrained hub models."
---
# PopV Population-Level Cell Type Annotation
PopV (Population Voting) annotates cell types by running up to 10 classification algorithms and aggregating predictions via majority voting. Unlike single-method annotation (SCSA, MetaTiME, CellTypist alone), PopV produces a consensus prediction that is more robust to individual algorithm failures. The module also supports ontology-aware voting via the Cell Ontology (CL) for hierarchical label resolution.
## Defensive Validation
```python
# Before PopV: verify reference has the cell type column
assert ref_labels_key in ref_adata.obs.columns, \
f"ref_adata.obs['{ref_labels_key}'] not found. Available: {list(ref_adata.obs.columns)}"
# Verify no NaN in reference labels
assert ref_adata.obs[ref_labels_key].notna().all(), \
f"NaN values in ref_adata.obs['{ref_labels_key}']. Use fillna() or drop these cells."
# Verify gene overlap
overlap = query_adata.var_names.intersection(ref_adata.var_names)
assert len(overlap) > 100, \
f"Only {len(overlap)} overlapping genes between query and reference. Check var_names format (ENSEMBL vs symbol)."
```
## Stage 1: Data Preparation
```python
import omicverse as ov
# Process_Query preprocesses and concatenates query + reference
process_obj = ov.popv.Process_Query(
query_adata=query_adata,
ref_adata=ref_adata,
ref_labels_key='cell_type', # REQUIRED: column in ref_adata.obs
ref_batch_key='batch', # batch column in ref_adata.obs
query_batch_key='batch', # batch column in query_adata.obs (optional)
cl_obo_folder=False, # False to skip ontology, or path to CL .obo file
prediction_mode='retrain', # 'retrain' | 'inference' | 'fast'
unknown_celltype_label='unknown', # label for query cells
n_samples_per_label=300, # subsample reference per cell type
hvg=4000, # number of highly variable genes
save_path_trained_models='tmp/', # where to save models
pretrained_scvi_path=None, # path to pretrained scVI model (optional)
)
```
**prediction_mode choices:**
- `'retrain'` — Train all models from scratch on reference+query. Most accurate, slowest.
- `'inference'` — Load previously saved models. Requires `save_path_trained_models` from prior run.
- `'fast'` — Skip integration-heavy algorithms. Uses FAST_ALGORITHMS subset.
**Preprocessing applied automatically:**
- Filters cells with < 30 total counts
- Log1p normalization (target_sum=1e4)
- PCA on reference (50 components)
- Stores raw counts in `layers['scvi_counts']`
## Stage 2: Annotation
```python
# Run all algorithms and compute consensus
ov.popv.annotate_data(
process_obj.adata,
methods='all', # or list of specific algorithms
save_path='results/popv/', # saves predictions.csv here
methods_kwargs=None, # dict of per-method overrides
)
```
### Available Algorithms (10 total)
| Algorithm | Result Key | Type | Speed |
|-----------|-----------|------|-------|
| `KNN_SCVI` | `popv_knn_on_scvi_prediction` | Deep learning + KNN | Medium |
| `SCANVI_POPV` | `popv_scanvi_prediction` | Semi-supervised DL | Medium |
| `CELLTYPIST` | `popv_celltypist_prediction` | Logistic regression | Fast |
| `ONCLASS` | `popv_onclass_prediction` | Ontology-guided | Medium |
| `Support_Vector` | `popv_svm_prediction` | SVM | Fast |
| `XGboost` | `popv_xgboost_prediction` | Gradient boosting | Fast |
| `KNN_HARMONY` | `popv_knn_harmony_prediction` | Harmony + KNN | Fast |
| `KNN_BBKNN` | `popv_knn_bbknn_prediction` | BBKNN + KNN | Fast |
| `Random_Forest` | `popv_rf_prediction` | Random forest | Fast |
| `KNN_SCANORAMA` | `popv_knn_scanorama_prediction` | Scanorama + KNN | Medium |
**Algorithm subsets:**
- `FAST_ALGORITHMS`: KNN_SCVI, SCANVI_POPV, Support_Vector, XGboost, ONCLASS, CELLTYPIST (used with `prediction_mode='fast'`)
- `CURRENT_ALGORITHMS`: All except Random_Forest and KNN_SCANORAMA (outdated)
- `'all'` or `None`: Uses CURRENT_ALGORITHMS (or FAST_ALGORITHMS in fast mode)
### Selecting Specific Methods
```python
# Run only fast classical methods
ov.popv.annotate_data(
process_obj.adata,
methods=['CELLTYPIST', 'Support_Vector', 'XGboost'],
)
# Override per-method parameters
ov.popv.annotate_data(
process_obj.adata,
methods=['KNN_SCVI', 'SCANVI_POPV'],
methods_kwargs={
'KNN_SCVI': {'train_kwargs': {'max_epochs': 50}},
'SCANVI_POPV': {'train_kwargs': {'max_epochs': 50}},
},
)
```
## Stage 3: Consensus Results & Visualization
After `annotate_data()`, these columns appear in `adata.obs`:
| Column | Description |
|--------|-------------|
| `popv_majority_vote_prediction` | Majority vote across all methods |
| `popv_majority_vote_score` | Number of agreeing methods |
| `popv_prediction` | Ontology-aggregated consensus (if CL enabled) |
| `popv_prediction_score` | Ontology consensus score |
```python
# Agreement plots: confusion matrices per method vs consensus
ov.popv.make_agreement_plots(
process_obj.adata,
prediction_keys=process_obj.adata.uns['prediction_keys'],
popv_prediction_key='popv_prediction',
save_folder='results/popv/',
show=True,
)
# Bar plot: agreement score per cell type
ov.popv.agreement_score_bar_plot(
process_obj.adata,
popv_prediction_key='popv_prediction',
save_folder='results/popv/',
)
# Bar plot: prediction score distribution
ov.popv.prediction_score_bar_plot(
process_obj.adata,
popv_prediction_score='popv_prediction_score',
save_folder='results/popv/',
)
# Bar plot: cell type proportions (ref vs query)
ov.popv.celltype_ratio_bar_plot(
process_obj.adata,
popv_prediction='popv_prediction',
save_folder='results/popv/',
)
```
## Stage 4: Pretrained Hub Models (Optional)
For large references (e.g., Human Cell Atlas), use pretrained models to skip training:
```python
from omicverse.popv.hub import HubModel
# Pull pretrained model from HuggingFace
model = HubModel.pull_from_huggingface_hub(
repo_name='popv/immune_all',
cache_dir='models/popv/',
)
# Annotate query data directly (fast mode)
result_adata = model.annotate_data(
query_adata=query_adata,
query_batch_key='batch',
prediction_mode='fast',
methods=None, # uses model's default methods
)
```
## Critical API Reference
```python
# CORRECT: methods as list of strings matching class names
ov.popv.annotate_data(adata, methods=['KNN_SCVI', 'CELLTYPIST', 'Support_Vector'])
# WRONG: passing class objects or lowercase names
# ov.popv.annotate_data(adata, methods=[KNN_SCVI, CELLTYPIST]) # TypeError
# ov.popv.annotate_data(adata, methods=['knn_scvi']) # KeyError
# CORRECT: ref_labels_key must exist in ref_adata.obs before Process_Query
assert 'cell_type' in ref_adata.obs.columns
process_obj = ov.popv.Process_Query(ref_labels_key='cell_type', ...)
# WRONG: forgetting to set unknown_celltype_label causes NaN in voting
# process_obj = ov.popv.Process_Query(..., unknown_celltype_label=None) # NaN errors
# CORRECT: access consensus results after annotation
final_labels = process_obj.adata.obs['popv_majority_vote_prediction']
# or ontology-refined:
final_labels = process_obj.adata.obs['popv_prediction']
# WRONG: looking for results on the original query_adata
# query_adata.obs['popv_prediction'] # KeyError: results are on process_obj.adata
```
## GPU Acceleration
```python
import omicverse.popv as popv
popv.settings.accelerator = 'gpu' # for scVI/scANVI training
popv.settings.cuml = True # for KNN/SVM/RF via cuML
popv.settings.n_jobs = 10 # parallel jobs for CPU methods
```
## Troubleshooting
- **`RuntimeError: CUDA out of memory` during scVI/scANVI training**: Reduce `hvg` (try 2000), decrease `n_samples_per_label` (try 100), or switch to `prediction_mode='fast'` which uses fewer epochs.
- **CellTypist model download fails**: Set `methods_kwargs={'CELLTYPIST': {'method_kwargs': {'model': '/path/to/local/model.pkl'}}}` to use a local model file.
- **Low consensus agreement (<50% cells agree)**: Some algorithms may not suit your tissue. Exclude underperforming methods: check per-method predictions and drop outliers from the `methods` list.
- **`KeyError: 'gene_name'` — gene identifier mismatch**: Harmonize var_names between reference and query before calling `Process_Query`. Use `adata.var_names = adata.var['gene_symbols']` if ENSEMBL IDs are in var_names.
- **`ValueError: batch_key contains NaN`**: Clean batch columns before PopV. Apply the batch validation pattern from the single-preprocessing skill: `adata.obs['batch'] = adata.obs['batch'].fillna('unknown').astype('category')`.
- **`FileNotFoundError` in inference mode**: Ensure `save_path_trained_models` points to the same directory used during the original `retrain` run. Check that model files (.pt, .pkl, .joblib) exist.
## Dependencies
- Core: `omicverse`, `scanpy`, `anndata`, `numpy`, `pandas`
- Deep learning: `scvi-tools`, `torch` (for KNN_SCVI, SCANVI_POPV)
- Classical ML: `scikit-learn`, `xgboost` (for RF, SVM, XGBoost)
- Integration: `harmonypy`, `bbknn`, `scanorama` (for respective KNN methods)
- Annotation: `celltypist`, `OnClass` (optional per method)
- Ontology: `obonet`, `pronto` (for ontology-aware voting)
- Hub: `huggingface_hub` (for pretrained models)
## Examples
- "Annotate my PBMC query data against a reference atlas using PopV with all 10 algorithms and visualize the consensus."
- "Use a pretrained PopV hub model to quickly annotate my lung tissue scRNA-seq data."
- "Run PopV with only classical methods (SVM, XGBoost, CellTypist) to annotate my query cells without GPU."
## References
- Quick copy/paste commands: [`reference.md`](reference.md)
This skill provides population-level cell type annotation by running up to ten classification algorithms and aggregating their outputs into a robust consensus. It supports retraining, inference from saved models, and fast modes that skip heavy integration steps. Ontology-aware voting via the Cell Ontology is available, and pretrained hub models can be pulled to skip training on large references.
The workflow first validates inputs and prepares data by filtering low-count cells, normalizing, selecting highly variable genes, and performing PCA. It then runs a configurable set of algorithms (deep learning, semi-supervised, classical ML, and integration-based KNNs) and stores each method’s predictions. A majority-vote consensus and optional ontology-aggregated consensus are computed and saved in the AnnData object. Visualization helpers generate agreement matrices, score distributions, and cell type ratio plots.
What prediction_mode should I pick?
Use 'retrain' for highest accuracy if you can train models; choose 'inference' to reuse previously saved models; pick 'fast' to skip integration-heavy methods and get results quickly.
Why do I get low consensus agreement?
Low agreement often means some algorithms are mismatched to the tissue. Inspect per-method outputs, remove underperforming methods, or improve reference quality (more samples per label, clearer annotations).
How do I avoid CUDA out-of-memory errors?
Reduce HVG (e.g., to 2000), lower n_samples_per_label, switch to 'fast' mode, or run on CPU by disabling GPU in settings.