home / skills / starlitnightly / omicverse / single-cellfate-analysis
This skill identifies pseudotime-associated genes driving lineage decisions by adaptive ridge regression and Mellon-based density scoring.
npx playbooks add skill starlitnightly/omicverse --skill single-cellfate-analysisReview the files below or copy the command above to add this skill to your agents.
---
name: cellfate-pseudotime-gene-analysis
title: CellFateGenie pseudotime gene analysis
description: "CellFateGenie: Adaptive Threshold Regression for pseudotime-associated gene discovery, Mellon density, lineage scoring via ov.single.Fate."
---
# CellFateGenie Analysis
Use this skill when the user wants to identify genes that drive cell fate decisions along a developmental trajectory. CellFateGenie discovers pseudotime-associated genes using adaptive ridge regression and then scores lineage-specific fate-driving genes via manifold density estimation.
This skill is used **after trajectory inference** — the user must already have pseudotime values computed (e.g., from Palantir, VIA, or diffusion pseudotime).
## Overview
CellFateGenie answers: "Which genes change most significantly along pseudotime, and which are specifically driving a particular lineage?" It works in two phases:
1. **Gene selection** — Adaptive Threshold Regression (ATR) iteratively removes low-coefficient genes while monitoring R² to find the minimal gene set that explains pseudotime
2. **Lineage scoring** — Mellon density estimation on the manifold identifies low-density transition regions, and lineage-specific variability scoring pinpoints fate-driving genes
## Prerequisites
- **Pseudotime**: Must exist as a column in `adata.obs`. Compute first using Palantir, VIA, DPT, or any trajectory method.
- **Mellon** (optional but important): `pip install mellon` for density estimation. Without it, `low_density()` will fail.
- **Expression data**: Works on the `.X` matrix. Log-normalized data is fine (unlike SCENIC which needs raw counts).
## Pipeline Steps
### 1. Initialize Fate object
```python
import omicverse as ov
# pseudotime: column name in adata.obs containing pseudotime values
fate = ov.single.Fate(adata, pseudotime='dpt_pseudotime')
# Automatically uses GPU (PyTorchRidge) if CUDA available, else sklearn Ridge on CPU
```
### 2. Initial ridge regression (model_init)
```python
coef_df = fate.model_init(
test_size=0.3, # Train/test split ratio
alpha=0.1, # Ridge regularization strength
use_data_augmentation=False, # Enable for noisy pseudotime
)
# Returns: DataFrame of gene coefficients
# Stores: fate.coef (all coefficients), fate.raw_r2, fate.raw_mse
```
### 3. Adaptive Threshold Regression (ATR) — feature selection
This is the core innovation. ATR iteratively removes genes with the smallest coefficients and monitors when R² starts dropping significantly:
```python
threshold_df = fate.ATR(
test_size=0.4,
alpha=0.1,
stop=500, # Maximum iterations. Increase for more genes (default 100).
flux=0.01, # R² drop tolerance. When R² drops by more than flux from max, stop.
)
# Sets fate.coef_threshold internally
# Visualize the filtering curve:
fate.plot_filtering() # Shows R² vs iteration, marks optimal threshold
```
### 4. Refit on selected genes (model_fit)
```python
filter_coef_df = fate.model_fit(
test_size=0.3,
alpha=0.1,
)
# Returns: DataFrame of coefficients for genes above threshold only
# Stores: fate.filter_coef
# Compare: fate.get_r2('raw') vs fate.get_r2('filter') — filter R² should be close to raw
```
### 5. Statistical validation (kendalltau_filter)
```python
kendall_df = fate.kendalltau_filter()
# Computes Kendall's tau rank correlation for each filtered gene vs pseudotime
# Returns: DataFrame with kendalltau_sta and pvalue per gene
# Confirms monotonic relationship — genes with high |tau| are truly pseudotime-associated
```
### 6. Mellon density estimation (low_density)
```python
fate.low_density(
n_components=10, # Diffusion map components for manifold representation
knn=30, # k-nearest neighbors for density estimation
alpha=0.0, # Mellon regularization
seed=0,
pca_key='X_pca', # PCA embedding to use
)
# Stores: adata.obs['mellon_log_density_lowd']
# Low-density regions = developmental transition points (branching, commitment)
```
### 7. Lineage-specific scoring (lineage_score)
```python
fate.lineage_score(
cluster_key='leiden', # Clustering column in adata.obs
lineage=['20', '17'], # Cluster labels defining the lineage of interest
cell_mask='specification', # How to select cells: 'specification' uses lineage list
density_key='mellon_log_density_lowd',
)
# Stores: adata.var['change_scores_lineage']
# High scores = genes with high expression variability specifically in that lineage
```
### 8. Identify fate-driving genes
```python
# Intersect ATR-selected genes with lineage-specific scores
fate_genes = adata.var.loc[fate.filter_coef.index, 'change_scores_lineage']
top_fate_genes = fate_genes.sort_values(ascending=False).head(20)
print(top_fate_genes)
```
## Data Augmentation
For noisy pseudotime estimates, enable augmentation to improve robustness:
```python
fate.model_init(
use_data_augmentation=True,
augmentation_strategy='jitter_pseudotime_noise', # or 'gene_expression_noise', 'both'
augmentation_intensity=0.05, # Noise magnitude (fraction of range)
)
# Same parameters available in ATR() and model_fit()
```
## ATAC Mode
CellFateGenie also works with scATAC-seq data:
```python
fate.atac_init(...) # Initialize for ATAC peak data
fate.get_related_peak(...) # Find peaks associated with fate genes
```
## Visualization
```python
# ATR filtering curve — shows R² vs iteration
fate.plot_filtering(figsize=(3, 3))
# Model fit quality
fate.plot_fitting(type='raw') # All genes
fate.plot_fitting(type='filter') # ATR-selected genes only
# Color-coded by cluster
fate.plot_color_fitting(type='filter', cluster_key='leiden')
```
## Critical API Reference
### Pseudotime column must exist in adata.obs
```python
# CORRECT
fate = ov.single.Fate(adata, pseudotime='dpt_pseudotime')
# WRONG — column doesn't exist
# fate = ov.single.Fate(adata, pseudotime='pseudotime') # KeyError if not in adata.obs
```
### ATR flux controls sensitivity
The `flux` parameter (default 0.01) determines when ATR stops removing genes. Lower flux = more genes retained (stricter R² preservation). Higher flux = fewer genes (more aggressive filtering).
### low_density requires mellon package
```python
# WRONG — mellon not installed
# fate.low_density() # ImportError: No module named 'mellon'
# FIX
# pip install mellon
```
## Defensive Validation Patterns
```python
# Verify pseudotime column exists
assert pseudotime_col in adata.obs.columns, \
f"Pseudotime column '{pseudotime_col}' not in adata.obs. Compute trajectory first."
# Verify pseudotime has valid values (no NaN)
import numpy as np
assert not adata.obs[pseudotime_col].isna().any(), \
f"Pseudotime column contains NaN. Filter cells or impute missing values."
# Verify mellon is installed (before low_density)
try:
import mellon
except ImportError:
print("WARNING: mellon not installed. Run: pip install mellon")
# Verify PCA exists (needed for low_density)
assert 'X_pca' in adata.obsm, "PCA required for low_density(). Run ov.pp.pca(adata) first."
# Verify lineage clusters exist (before lineage_score)
for cl in lineage_list:
assert cl in adata.obs[cluster_key].values, \
f"Cluster '{cl}' not found in adata.obs['{cluster_key}']. Available: {adata.obs[cluster_key].unique()}"
```
## Troubleshooting
- **`KeyError` on pseudotime column**: The column name passed to `Fate()` doesn't exist in `adata.obs`. Check with `adata.obs.columns.tolist()`.
- **`ImportError: No module named 'mellon'`**: Install with `pip install mellon`. This is required for `low_density()` but not for ATR/model_fit.
- **Low R² after ATR (<0.3)**: Pseudotime may not be well-correlated with gene expression. Try a different trajectory method or increase `stop` iterations.
- **`plot_filtering()` shows flat curve**: The dataset may have too few variable genes. Ensure HVG selection was done before CellFateGenie.
- **GPU out of memory**: CellFateGenie uses PyTorchRidge when CUDA is available. For large datasets, it falls back to CPU automatically, but you can force CPU by setting `CUDA_VISIBLE_DEVICES=""`.
- **`lineage_score` returns all zeros**: The specified lineage clusters may have too few cells. Check cluster sizes with `adata.obs[cluster_key].value_counts()`.
## Examples
- "Find genes driving erythroid differentiation along my Palantir pseudotime."
- "Run CellFateGenie to identify fate-associated genes and plot the ATR filtering curve."
- "Score lineage-specific genes for the monocyte branch in my trajectory."
## References
- Tutorials: `t_cellfate.ipynb`, `t_cellfate_gene.ipynb`, `t_cellfate_genesets.ipynb`
- Quick copy/paste commands: [`reference.md`](reference.md)
This skill implements CellFateGenie: an end-to-end workflow to discover pseudotime-associated and lineage-driving genes using adaptive threshold ridge regression and manifold density scoring. It combines Adaptive Threshold Regression (ATR) for compact gene selection with Mellon-based low-density detection to prioritize fate-driving genes along trajectories. Use it after you have computed pseudotime for single-cell or bulk trajectory analyses.
First, a ridge regression model is fit to predict pseudotime from gene expression, producing gene coefficients and baseline fit metrics. ATR then iteratively removes low-coefficient genes while monitoring R² to find the smallest gene set that preserves fit. Optionally, Mellon density estimation on a diffusion manifold detects low-density transition regions, and a lineage_score routine computes lineage-specific expression variability to flag fate-driving genes.
What happens if mellon is not installed?
low_density() will fail; ATR and model fitting still work. Install mellon with pip install mellon to enable density-based steps.
How do I choose the ATR flux parameter?
flux controls allowed R² drop from the maximum. Lower flux preserves more genes (safer); higher flux yields a smaller gene set. Start with default 0.01 and adjust based on desired parsimony.