home / skills / starlitnightly / omicverse / fm-foundation-models

fm-foundation-models skill

Q: What if profile_data reports gene ID mismatch?

Either convert gene IDs to the required scheme or select a model that expects your IDs (e.g., Geneformer for Ensembl). The profile output lists recommended fixes.

Q: How do I avoid CUDA out-of-memory errors?

Lower batch_size (try 32 or 16), ensure no other GPU processes run, or select a model with lower VRAM requirements or CPU fallback (Geneformer or CellPLM).

Q: Where are embeddings stored after a run?

Embeddings are written to adata.obsm with model-specific keys (e.g., 'X_scGPT', 'X_geneformer', 'X_uce'). Check result['output_keys'] for exact names.

safe

/.claude/skills/fm-foundation-models

This skill helps you run foundation model workflows for single-cell analysis, from embedding to annotation and integration across 22 models with a unified API.

npx playbooks add skill starlitnightly/omicverse --skill fm-foundation-models

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

8.3 KB

---
name: foundation-model-analysis
title: Foundation model analysis
description: "Foundation model workflows: scGPT, Geneformer, UCE, CellPLM cell embedding, annotation, integration via ov.fm unified API. 22 models."
---

# Foundation Model Analysis

Use this skill when a user wants to generate cell embeddings, annotate cell types, integrate batches, or predict perturbation effects using single-cell foundation models. The `ov.fm` module provides a unified 6-step API that works identically across all 22 supported models.

## Model Selection Guide

Pick a model based on your task, species, and hardware. The 5 skill-ready models have full adapter support:

| Model | Tasks | Species | Gene IDs | Min VRAM | CPU? | Best when |
|-------|-------|---------|----------|----------|------|-----------|
| **scGPT** | embed, integrate | human, mouse | symbol | 8 GB | Yes | General RNA, multi-modal (RNA+ATAC+Spatial) |
| **Geneformer** | embed, integrate | human | **ensembl** | 4 GB | **Yes** | Ensembl IDs, CPU-only environments, network biology |
| **UCE** | embed, integrate | 7 species | symbol | 16 GB | No | Cross-species (zebrafish, macaque, pig, frog, lemur) |
| **scFoundation** | embed, integrate | human | custom | 16 GB | No | xTrimoGene architecture, perturbation tasks |
| **CellPLM** | embed, integrate | human | symbol | 8 GB | Yes | Fastest inference (batch_size=128), cell-centric |

12 additional partial models (scBERT, GeneCompass, Nicheformer, scMulan, tGPT, CellFM, scCello, scPrint, AiDocell, Pulsar, Atacformer, scPlantLLM) and 5+ experimental models are also registered.

### Quick decision tree
- Cross-species → **UCE**
- Ensembl gene IDs or CPU-only → **Geneformer**
- ATAC-seq data → **Atacformer** (partial) or **scGPT**
- Multi-omics (RNA+ATAC+Protein) → **scMulan** (partial)
- Spatial transcriptomics → **Nicheformer** (partial) or **scGPT**
- Fastest throughput on large datasets → **CellPLM**
- General RNA, no special needs → **scGPT**

## 6-Step Unified Workflow

Every FM analysis follows the same pipeline regardless of model choice:

### Step 1: Discover available models
```python
import omicverse as ov
models = ov.fm.list_models(task="embed", skill_ready_only=True)
# Returns: {"count": int, "models": [{name, version, tasks, species, ...}]}
```

### Step 2: Profile your data
```python
profile = ov.fm.profile_data("pbmc3k.h5ad")
# Auto-detects: species, gene_scheme (symbol/ensembl), modality, batch/celltype columns
# Returns model_compatibility per model (compatible: bool, issues: [], recommendations: [])
```

### Step 3: Select best model
```python
selection = ov.fm.select_model(
    "pbmc3k.h5ad",
    task="embed",
    prefer_zero_shot=True,   # No labeled data needed
    max_vram_gb=8,            # Hardware constraint
)
model_name = selection['recommended']['name']
# Returns: recommended model with rationale, fallbacks list, preprocessing_notes
```

### Step 4: Validate compatibility
```python
validation = ov.fm.preprocess_validate("pbmc3k.h5ad", model_name, "embed")
# Returns: status ("ready"/"needs_preprocessing"), diagnostics, auto_fixes suggestions
```

### Step 5: Execute
```python
result = ov.fm.run(
    task="embed",
    model_name=model_name,
    adata_path="pbmc3k.h5ad",
    output_path="pbmc3k_embedded.h5ad",
    device="auto",          # auto-detects cuda/mps/cpu
    batch_size=64,          # CellPLM can use 128
    batch_key=None,         # For integration tasks
    label_key=None,         # For annotation tasks
    checkpoint_dir=None,    # Auto-resolved from env vars or cache
)
# Returns: {output_path, output_keys, statistics} or {error: str}
```

### Step 6: Interpret results
```python
metrics = ov.fm.interpret_results(
    "pbmc3k_embedded.h5ad",
    task="embed",
    generate_umap=True,
    color_by=["cell_type"],
)
# Returns: metrics (silhouette scores), visualizations, embedding_keys
```

## Gene ID Resolution

Gene ID mismatch is the most common failure mode. The `profile_data()` function detects your data's gene scheme automatically.

| Model | Expected IDs | Example | Auto-convert? |
|-------|-------------|---------|---------------|
| scGPT | HGNC symbols | TP53, CD4 | No — data must use symbols |
| Geneformer | Ensembl IDs | ENSG00000141510 | No — data must use Ensembl |
| UCE | HGNC symbols | TP53, CD4 | No |
| scFoundation | Custom vocab | 19,264 gene vocab | Adapter handles mapping |

If `profile_data()` reports `gene_scheme: "ensembl"` but you selected scGPT (which needs symbols), either convert gene IDs first or switch to Geneformer.

## Hardware Requirements

| Model | GPU Required | Min VRAM | CPU Fallback | Default Batch Size |
|-------|-------------|----------|--------------|-------------------|
| scGPT | Recommended | 8 GB | Yes (slow) | 64 |
| Geneformer | No | 4 GB | **Yes (full speed)** | 64 |
| UCE | Yes | 16 GB | No | 64 |
| scFoundation | Yes | 16 GB | No | 64 |
| CellPLM | Recommended | 8 GB | Yes | 128 |

Device auto-detection priority: CUDA → MPS (Apple Silicon) → CPU.

Checkpoint resolution priority:
1. `checkpoint_dir` parameter in `ov.fm.run()`
2. Model-specific env var: `OV_FM_CHECKPOINT_DIR_SCGPT`
3. Base env var + subfolder: `OV_FM_CHECKPOINT_DIR/scgpt/`
4. OmicVerse model cache (auto-download if available)

## Critical API Reference

### TaskType values
`"embed"`, `"annotate"`, `"integrate"`, `"perturb"`, `"spatial"`, `"drug_response"`

### Output keys are model-specific
```python
# After running, embeddings are stored in adata.obsm with model-specific keys:
# scGPT → adata.obsm['X_scGPT']
# Geneformer → adata.obsm['X_geneformer']
# UCE → adata.obsm['X_uce']
# Check result['output_keys'] for the exact keys written
```

### Provenance tracking
```python
# Every run writes provenance to adata.uns['fm']:
# {"runs_json": [...], "latest_json": "..."}
# Contains: model_name, version, task, timestamp, output_keys
```

## Defensive Validation Patterns

```python
import os

# Before any FM workflow: verify input file exists
assert os.path.isfile(adata_path), f"Input file not found: {adata_path}"

# After profile_data: check species was detected
profile = ov.fm.profile_data(adata_path)
assert 'unknown' not in profile.get('species', 'unknown').lower(), \
    f"Species not detected. Check gene names — use HGNC symbols (human) or standard names (mouse)."

# After select_model: verify a model was recommended
selection = ov.fm.select_model(adata_path, task="embed")
assert 'recommended' in selection and selection['recommended'], \
    "No compatible model found. Check species, gene IDs, and hardware constraints."

# Before run: verify checkpoint exists (for models requiring local weights)
desc = ov.fm.describe_model(model_name)
if desc.get('resources', {}).get('checkpoint_url'):
    print(f"Model may need checkpoint download. Check OV_FM_CHECKPOINT_DIR env var.")
```

## Troubleshooting

- **`Gene ID mismatch` warning in profile**: Your data uses Ensembl IDs but the selected model expects symbols (or vice versa). Convert with `adata.var_names = adata.var['gene_symbols']` or switch to a compatible model.
- **`CUDA out of memory`**: Reduce `batch_size` (try 32 or 16). For UCE/scFoundation (16GB VRAM), ensure no other GPU processes are running.
- **`Model not installed` or `ImportError`**: Some models need separate packages. Install via `pip install scgpt` / `pip install geneformer` or use conda isolation (`OV_FM_DISABLE_CONDA_SUBPROCESS=0`).
- **`Species unsupported`**: Most models only support human. For mouse, use scGPT or UCE. For zebrafish/pig/frog, only UCE works.
- **Empty embeddings (all zeros)**: Input data may have constant or near-zero expression. Filter genes with `sc.pp.filter_genes(adata, min_cells=10)` before running.
- **`Checkpoint not found`**: Set the environment variable `OV_FM_CHECKPOINT_DIR_<MODEL>=/path/to/weights` or pass `checkpoint_dir` directly to `ov.fm.run()`.
- **`device='auto'` picks CPU despite GPU available**: Check `torch.cuda.is_available()`. If False, verify CUDA drivers and PyTorch CUDA build.

## Examples
- "Generate scGPT embeddings for my PBMC dataset and visualize on UMAP."
- "Which foundation model works best for my mouse brain scRNA-seq data?"
- "Embed my ATAC-seq data using a foundation model — I only have 8GB VRAM."
- "Profile my h5ad file and tell me which models are compatible."

## References
- Quick copy/paste commands: [`reference.md`](reference.md)
- FM API source: `omicverse/fm/api.py`
- Model registry: `omicverse/fm/registry.py`

Overview

This skill provides a unified API to run 22 foundation models for bulk, single-cell, and spatial RNA-seq analysis. It streamlines common workflows: generating cell embeddings, annotating cell types, integrating batches, and predicting perturbation or drug responses. The API uses a consistent 6-step pipeline so the same calls work across models and hardware constraints.

How this skill works

The skill inspects your input AnnData or h5ad file and auto-detects species, gene ID scheme (symbol vs Ensembl), modality, and batch/label columns. It lists compatible models, recommends the best model given task and hardware, validates preprocessing requirements, runs the selected model (embedding/annotation/integration/perturbation), and writes outputs and provenance into the AnnData object. Device selection, checkpoint resolution, and output key naming are handled automatically with clear diagnostics and auto-fix suggestions.

When to use it

You need high-quality cell embeddings for downstream clustering or visualization.
You want automated cell-type annotation using a foundation model.
You must integrate multiple batches or modalities with minimal manual tuning.
You want to predict perturbation or drug-response effects at single-cell resolution.
You need cross-species embedding or analysis for non-human samples.

Best practices

Run profile_data() first to detect gene ID scheme and species before selecting a model.
Match gene IDs to model expectations (symbols vs Ensembl) or choose Geneformer for Ensembl IDs.
Check device and min VRAM reported by model descriptions; reduce batch_size if encountering OOM.
Keep provenance: every run writes metadata to adata.uns['fm'] for reproducibility.
Filter low-quality genes/cells (e.g., filter genes seen in <10 cells) to avoid empty embeddings.

Example use cases

Generate scGPT or CellPLM embeddings for PBMC scRNA-seq and visualize on UMAP.
Profile an h5ad and automatically recommend a model that fits a CPU-only environment (e.g., Geneformer).
Integrate multiple batches from a multi-center study using scGPT or UCE and validate silhouette scores.
Predict single-cell response to a perturbation with scFoundation when a matching adapter is available.
Embed cross-species single-cell data (zebrafish, macaque, pig) using UCE for comparative analysis.

FAQ

What if profile_data reports gene ID mismatch?

Either convert gene IDs to the required scheme or select a model that expects your IDs (e.g., Geneformer for Ensembl). The profile output lists recommended fixes.

How do I avoid CUDA out-of-memory errors?

Lower batch_size (try 32 or 16), ensure no other GPU processes run, or select a model with lower VRAM requirements or CPU fallback (Geneformer or CellPLM).

Where are embeddings stored after a run?

Embeddings are written to adata.obsm with model-specific keys (e.g., 'X_scGPT', 'X_geneformer', 'X_uce'). Check result['output_keys'] for exact names.