home / skills / starlitnightly / omicverse / datasets-loading

datasets-loading skill

safe

This skill provides ready-to-use omicverse built-in datasets and mock data generation to accelerate demos, testing, and signature analyses.

npx playbooks add skill starlitnightly/omicverse --skill datasets-loading

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

8.0 KB

---
name: datasets-loading
title: OmicVerse built-in datasets and mock data
description: "OmicVerse built-in datasets: pbmc3k, pancreas, dentategyrus, zebrafish, immune, spatial, multiome, plus create_mock_dataset() and predefined_signatures GMT gene sets."
---

# OmicVerse Built-in Datasets

`ov.datasets` provides 30+ ready-to-use datasets with automatic download, caching, and fallback to mock data. Use these instead of manually downloading files or relying on `scanpy.datasets`.

## When to Use This Module

- **Tutorials/demos**: Load standard benchmarks (PBMC3k, Paul15, dentate gyrus) with one function call
- **Testing pipelines**: Use `create_mock_dataset()` to generate synthetic data without downloads
- **Gene set analysis**: Use `predefined_signatures` for curated GMT gene sets (cell cycle, gender, mitochondrial, tissue-specific)
- **Velocity workflows**: Load pre-formatted datasets with spliced/unspliced layers

## Dataset Catalog

### Single-Cell

| Function | Cells | Genes | Description |
|----------|-------|-------|-------------|
| `ov.datasets.pbmc3k()` | 2,700 | 32,738 | 10x PBMC3k (raw or processed) |
| `ov.datasets.pbmc8k()` | ~8,000 | — | 10x PBMC 8k |
| `ov.datasets.paul15()` | 2,730 | 3,451 | Myeloid progenitors |
| `ov.datasets.krumsiek11()` | 640 | 11 | Myeloid differentiation simulation |
| `ov.datasets.bone_marrow()` | 5,780 | 27,876 | Bone marrow hematopoietic |
| `ov.datasets.hematopoiesis()` | — | — | Processed hematopoiesis |
| `ov.datasets.hematopoiesis_raw()` | — | — | Raw hematopoiesis |
| `ov.datasets.sc_ref_Lymph_Node()` | ~10,000 | ~15,000 | Lymph node reference |
| `ov.datasets.bhattacherjee()` | ~5,000 | ~2,000 | Mouse PFC cocaine study |
| `ov.datasets.human_tfs()` | — | — | Human TF list (DataFrame) |

### RNA Velocity & Trajectories

| Function | Cells | Genes | Description |
|----------|-------|-------|-------------|
| `ov.datasets.dentate_gyrus()` | 18,213 | 27,998 | Dentate gyrus (loom) |
| `ov.datasets.dentate_gyrus_scvelo()` | 2,930 | 13,913 | DG subset from scVelo |
| `ov.datasets.zebrafish()` | 4,181 | 16,940 | Zebrafish developmental |
| `ov.datasets.pancreatic_endocrinogenesis()` | — | — | Pancreatic epithelial |
| `ov.datasets.pancreas_cellrank()` | 2,930 | 13,913 | Pancreas cellrank benchmark |
| `ov.datasets.scnt_seq_neuron_splicing()` | 13,476 | 44,021 | scNT-seq neuron splicing |
| `ov.datasets.scnt_seq_neuron_labeling()` | 3,060 | 24,078 | scNT-seq neuron labeling |
| `ov.datasets.sceu_seq_rpe1()` | ~2,930 | ~13,913 | scEU-seq RPE1 |
| `ov.datasets.sceu_seq_organoid()` | 3,831 | 9,157 | scEU-seq organoid |
| `ov.datasets.haber()` | 7,216 | 27,998 | Intestinal epithelium |
| `ov.datasets.chromaffin()` | — | — | Chromaffin cell lineage |
| `ov.datasets.hg_forebrain_glutamatergic()` | 1,720 | 32,738 | Human forebrain |
| `ov.datasets.toggleswitch()` | 200 | 2 | Two-gene simulation |

### Spatial & Multiome

| Function | Description |
|----------|-------------|
| `ov.datasets.seqfish()` | SeqFISH spatial transcriptomics |
| `ov.datasets.multi_brain_5k()` | 10x E18 mouse brain multiome (MuData) |

### Bulk RNA-seq & Deconvolution

| Function | Description |
|----------|-------------|
| `ov.datasets.burczynski06()` | UC/CD PBMC bulk (127 samples) |
| `ov.datasets.moignard15()` | Embryo hematopoiesis qRT-PCR |
| `ov.datasets.decov_bulk_covid_bulk()` | COVID-19 PBMC bulk |
| `ov.datasets.decov_bulk_covid_single()` | COVID-19 PBMC single-cell ref |

### Synthetic

| Function | Description |
|----------|-------------|
| `ov.datasets.create_mock_dataset()` | Configurable synthetic scRNA-seq |
| `ov.datasets.blobs()` | Gaussian blob clusters |

## Mock Data Generation

Use `create_mock_dataset()` when you need data without network access or for pipeline testing:

```python
import omicverse as ov

# Basic mock dataset
adata = ov.datasets.create_mock_dataset(
    n_cells=2000,
    n_genes=1500,
    n_cell_types=6,
    with_clustering=False,
    random_state=42,
)
# adata.obs: cell_type, sample_id, condition, tissue
# adata.var: gene_symbols, highly_variable

# With full preprocessing (normalized, PCA, UMAP, leiden)
adata = ov.datasets.create_mock_dataset(
    n_cells=5000,
    n_genes=3000,
    n_cell_types=10,
    with_clustering=True,
)
```

**Features:**
- Negative binomial expression distribution
- Cell-type-specific marker genes (2-5x expression multiplier)
- Gene names: `Gene_0001`, `Gene_0002`, ...
- `with_clustering=True` adds: normalization, HVG, scaling, PCA, UMAP, leiden

## Predefined Gene Set Signatures

Pre-loaded GMT files for common scoring tasks:

```python
from omicverse.datasets import predefined_signatures, load_signatures_from_file

# Available signature keys
print(list(predefined_signatures.keys()))
# ['cell_cycle_human', 'cell_cycle_mouse', 'gender_human', 'gender_mouse',
#  'mitochondrial_genes_human', 'mitochondrial_genes_mouse',
#  'ribosomal_genes_human', 'ribosomal_genes_mouse',
#  'apoptosis_human', 'apoptosis_mouse',
#  'human_lung', 'mouse_lung', 'mouse_brain', 'mouse_liver', 'emt_human']

# Load a signature → dict[str, list[str]]
cell_cycle = load_signatures_from_file(predefined_signatures['cell_cycle_human'])
# {'S_genes': ['MCM5', 'PCNA', ...], 'G2M_genes': ['HMGB2', 'CDK1', ...]}

# Use with scoring
import scanpy as sc
sc.tl.score_genes_cell_cycle(adata, s_genes=cell_cycle['S_genes'],
                              g2m_genes=cell_cycle['G2M_genes'])
```

## Critical API Reference

```python
# CORRECT: use ov.datasets for standard benchmarks
adata = ov.datasets.pbmc3k()

# WRONG: manually downloading what's already built-in
# import urllib.request
# urllib.request.urlretrieve('https://...', 'pbmc3k.h5ad')  # unnecessary!
# adata = ov.read('pbmc3k.h5ad')

# CORRECT: pbmc3k(processed=True) for pre-processed version
adata = ov.datasets.pbmc3k(processed=True)

# WRONG: loading raw then manually preprocessing for a demo
# adata = ov.datasets.pbmc3k()
# sc.pp.normalize_total(adata)  # unnecessary if you just need a quick demo

# CORRECT: mock data for testing (no network needed)
adata = ov.datasets.create_mock_dataset(n_cells=500, n_genes=200)

# WRONG: creating synthetic data manually with numpy
# X = np.random.poisson(1, (500, 200))  # missing metadata, layers, etc.
```

## Caching Behavior

- **Default cache directory:** `./data/` (relative to working directory)
- **Skip if exists:** All functions check for existing files before downloading
- **Mirror fallback:** Stanford and Figshare mirrors for reliability
- **Mock fallback:** Most functions generate mock data if download fails (network issues)
- **`var_names_make_unique()`** called automatically after loading

## Troubleshooting

- **Download timeout / 403 error**: Some datasets use `download_data_requests()` with custom headers. If persistent, manually download the file to `./data/` with the expected filename and the function will find it.
- **`ModuleNotFoundError: No module named 'muon'`** when calling `multi_brain_5k()`: Install muon: `pip install muon`. This function returns MuData, not AnnData.
- **Mock dataset has no `.raw` or `layers['counts']`**: Add manually after creation: `ov.utils.store_layers(adata, layers='counts')` and `adata.raw = adata`.
- **`load_signatures_from_file` returns empty dict**: Verify the GMT file path. Use `predefined_signatures['key']` which resolves to the bundled file via `importlib.resources`.
- **Dentate gyrus loom download is slow**: The loom file is large (~200MB). Use `ov.datasets.dentate_gyrus_scvelo()` for the smaller pre-processed subset (2,930 cells).

## Dependencies
- Core: `omicverse`, `scanpy`, `anndata`, `numpy`, `pandas`
- Downloads: `tqdm`, `requests` (for mirror fallback)
- Multiome: `muon` (only for `multi_brain_5k()`)
- Signatures: `importlib.resources` (stdlib)

## Examples
- "Load the PBMC3k dataset and run the standard preprocessing pipeline."
- "Create a mock dataset with 5000 cells and 8 cell types for testing my clustering workflow."
- "Load cell cycle gene signatures and score my adata for S and G2M phase genes."

## References
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill exposes OmicVerse built-in datasets and utilities for bulk, single-cell, spatial, and multiome RNA-seq workflows. It provides one-call access to standard benchmarks (pbmc3k, pancreas, dentategyrus, zebrafish, immune, spatial, multiome), a configurable mock data generator, and bundled GMT gene-set signatures. Downloads are cached, fall back to mirrors, and will generate mock data if network access fails.

How this skill works

Call ov.datasets.<dataset_name>() to download and load curated AnnData/MuData objects with automatic caching and var-name deduplication. Use ov.datasets.create_mock_dataset() to synthesize negative-binomial count matrices with cell-type markers, metadata, and optional preprocessing (normalization, HVG, PCA, UMAP, leiden). Load gene sets via predefined_signatures and load_signatures_from_file() to get dicts of gene lists for scoring.

When to use it

Quick demos or tutorials where you need a reproducible benchmark dataset (pbmc3k, paul15, dentate gyrus).
Unit tests and CI for pipelines when network access is restricted—use create_mock_dataset().
Trajectory and RNA velocity workflows that require pre-formatted spliced/unspliced layers (dentate_gyrus, zebrafish).
Spatial or multiome analyses using seqFISH and multi_brain_5k MuData.
Gene set scoring tasks using bundled GMTs (cell cycle, mitochondrial, tissue-specific).

Best practices

Prefer ov.datasets functions over manual downloads—caching avoids repeated transfer and ensures consistent filenames.
Use processed=True when you want ready-to-use normalized data for demos; use raw when you need raw counts.
For reproducible tests, set random_state in create_mock_dataset() and include with_clustering=True to get full preprocessing.
Check prerequisites for multiome (muon) before calling multi_brain_5k(); expect MuData rather than AnnData.
If download fails, place the expected file in ./data/ with the documented filename so the loader will find it.

Example use cases

Load pbmc3k(processed=True) and run a tutorial pipeline without writing download code.
Generate a 5k-cell mock dataset with 10 cell types to benchmark clustering and marker detection.
Load dentate_gyrus_scvelo() for a lightweight RNA velocity demo instead of the full loom file.
Score cell cycle phase using predefined_signatures['cell_cycle_human'] and scanpy.score_genes_cell_cycle.
Use seqfish() example to test spatial plotting and spot-level analyses.

FAQ

What happens if a dataset download fails?

Functions check the cache, try mirrors, and will return mock data for many datasets if downloads fail; you can also manually place the file in ./data/.

How do I get gene signatures for scoring?

Use predefined_signatures to access bundled GMTs and load_signatures_from_file() to load them as dicts of gene lists (e.g., S_genes, G2M_genes).