home / skills / starlitnightly / omicverse / single-cellfate-analysis

single-cellfate-analysis skill

safe

/.claude/skills/single-cellfate-analysis

This skill identifies pseudotime-associated genes driving lineage decisions by adaptive ridge regression and Mellon-based density scoring.

npx playbooks add skill starlitnightly/omicverse --skill single-cellfate-analysis

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

8.6 KB

---
name: cellfate-pseudotime-gene-analysis
title: CellFateGenie pseudotime gene analysis
description: "CellFateGenie: Adaptive Threshold Regression for pseudotime-associated gene discovery, Mellon density, lineage scoring via ov.single.Fate."
---

# CellFateGenie Analysis

Use this skill when the user wants to identify genes that drive cell fate decisions along a developmental trajectory. CellFateGenie discovers pseudotime-associated genes using adaptive ridge regression and then scores lineage-specific fate-driving genes via manifold density estimation.

This skill is used **after trajectory inference** — the user must already have pseudotime values computed (e.g., from Palantir, VIA, or diffusion pseudotime).

## Overview

CellFateGenie answers: "Which genes change most significantly along pseudotime, and which are specifically driving a particular lineage?" It works in two phases:

1. **Gene selection** — Adaptive Threshold Regression (ATR) iteratively removes low-coefficient genes while monitoring R² to find the minimal gene set that explains pseudotime
2. **Lineage scoring** — Mellon density estimation on the manifold identifies low-density transition regions, and lineage-specific variability scoring pinpoints fate-driving genes

## Prerequisites

- **Pseudotime**: Must exist as a column in `adata.obs`. Compute first using Palantir, VIA, DPT, or any trajectory method.
- **Mellon** (optional but important): `pip install mellon` for density estimation. Without it, `low_density()` will fail.
- **Expression data**: Works on the `.X` matrix. Log-normalized data is fine (unlike SCENIC which needs raw counts).

## Pipeline Steps

### 1. Initialize Fate object

```python
import omicverse as ov

# pseudotime: column name in adata.obs containing pseudotime values
fate = ov.single.Fate(adata, pseudotime='dpt_pseudotime')
# Automatically uses GPU (PyTorchRidge) if CUDA available, else sklearn Ridge on CPU
```

### 2. Initial ridge regression (model_init)

```python
coef_df = fate.model_init(
    test_size=0.3,           # Train/test split ratio
    alpha=0.1,               # Ridge regularization strength
    use_data_augmentation=False,  # Enable for noisy pseudotime
)
# Returns: DataFrame of gene coefficients
# Stores: fate.coef (all coefficients), fate.raw_r2, fate.raw_mse
```

### 3. Adaptive Threshold Regression (ATR) — feature selection

This is the core innovation. ATR iteratively removes genes with the smallest coefficients and monitors when R² starts dropping significantly:

```python
threshold_df = fate.ATR(
    test_size=0.4,
    alpha=0.1,
    stop=500,    # Maximum iterations. Increase for more genes (default 100).
    flux=0.01,   # R² drop tolerance. When R² drops by more than flux from max, stop.
)
# Sets fate.coef_threshold internally
# Visualize the filtering curve:
fate.plot_filtering()  # Shows R² vs iteration, marks optimal threshold
```

### 4. Refit on selected genes (model_fit)

```python
filter_coef_df = fate.model_fit(
    test_size=0.3,
    alpha=0.1,
)
# Returns: DataFrame of coefficients for genes above threshold only
# Stores: fate.filter_coef
# Compare: fate.get_r2('raw') vs fate.get_r2('filter') — filter R² should be close to raw
```

### 5. Statistical validation (kendalltau_filter)

```python
kendall_df = fate.kendalltau_filter()
# Computes Kendall's tau rank correlation for each filtered gene vs pseudotime
# Returns: DataFrame with kendalltau_sta and pvalue per gene
# Confirms monotonic relationship — genes with high |tau| are truly pseudotime-associated
```

### 6. Mellon density estimation (low_density)

```python
fate.low_density(
    n_components=10,     # Diffusion map components for manifold representation
    knn=30,              # k-nearest neighbors for density estimation
    alpha=0.0,           # Mellon regularization
    seed=0,
    pca_key='X_pca',     # PCA embedding to use
)
# Stores: adata.obs['mellon_log_density_lowd']
# Low-density regions = developmental transition points (branching, commitment)
```

### 7. Lineage-specific scoring (lineage_score)

```python
fate.lineage_score(
    cluster_key='leiden',              # Clustering column in adata.obs
    lineage=['20', '17'],              # Cluster labels defining the lineage of interest
    cell_mask='specification',         # How to select cells: 'specification' uses lineage list
    density_key='mellon_log_density_lowd',
)
# Stores: adata.var['change_scores_lineage']
# High scores = genes with high expression variability specifically in that lineage
```

### 8. Identify fate-driving genes

```python
# Intersect ATR-selected genes with lineage-specific scores
fate_genes = adata.var.loc[fate.filter_coef.index, 'change_scores_lineage']
top_fate_genes = fate_genes.sort_values(ascending=False).head(20)
print(top_fate_genes)
```

## Data Augmentation

For noisy pseudotime estimates, enable augmentation to improve robustness:

```python
fate.model_init(
    use_data_augmentation=True,
    augmentation_strategy='jitter_pseudotime_noise',  # or 'gene_expression_noise', 'both'
    augmentation_intensity=0.05,  # Noise magnitude (fraction of range)
)
# Same parameters available in ATR() and model_fit()
```

## ATAC Mode

CellFateGenie also works with scATAC-seq data:

```python
fate.atac_init(...)          # Initialize for ATAC peak data
fate.get_related_peak(...)   # Find peaks associated with fate genes
```

## Visualization

```python
# ATR filtering curve — shows R² vs iteration
fate.plot_filtering(figsize=(3, 3))

# Model fit quality
fate.plot_fitting(type='raw')     # All genes
fate.plot_fitting(type='filter')  # ATR-selected genes only

# Color-coded by cluster
fate.plot_color_fitting(type='filter', cluster_key='leiden')
```

## Critical API Reference

### Pseudotime column must exist in adata.obs

```python
# CORRECT
fate = ov.single.Fate(adata, pseudotime='dpt_pseudotime')

# WRONG — column doesn't exist
# fate = ov.single.Fate(adata, pseudotime='pseudotime')  # KeyError if not in adata.obs
```

### ATR flux controls sensitivity

The `flux` parameter (default 0.01) determines when ATR stops removing genes. Lower flux = more genes retained (stricter R² preservation). Higher flux = fewer genes (more aggressive filtering).

### low_density requires mellon package

```python
# WRONG — mellon not installed
# fate.low_density()  # ImportError: No module named 'mellon'

# FIX
# pip install mellon
```

## Defensive Validation Patterns

```python
# Verify pseudotime column exists
assert pseudotime_col in adata.obs.columns, \
    f"Pseudotime column '{pseudotime_col}' not in adata.obs. Compute trajectory first."

# Verify pseudotime has valid values (no NaN)
import numpy as np
assert not adata.obs[pseudotime_col].isna().any(), \
    f"Pseudotime column contains NaN. Filter cells or impute missing values."

# Verify mellon is installed (before low_density)
try:
    import mellon
except ImportError:
    print("WARNING: mellon not installed. Run: pip install mellon")

# Verify PCA exists (needed for low_density)
assert 'X_pca' in adata.obsm, "PCA required for low_density(). Run ov.pp.pca(adata) first."

# Verify lineage clusters exist (before lineage_score)
for cl in lineage_list:
    assert cl in adata.obs[cluster_key].values, \
        f"Cluster '{cl}' not found in adata.obs['{cluster_key}']. Available: {adata.obs[cluster_key].unique()}"
```

## Troubleshooting

- **`KeyError` on pseudotime column**: The column name passed to `Fate()` doesn't exist in `adata.obs`. Check with `adata.obs.columns.tolist()`.
- **`ImportError: No module named 'mellon'`**: Install with `pip install mellon`. This is required for `low_density()` but not for ATR/model_fit.
- **Low R² after ATR (<0.3)**: Pseudotime may not be well-correlated with gene expression. Try a different trajectory method or increase `stop` iterations.
- **`plot_filtering()` shows flat curve**: The dataset may have too few variable genes. Ensure HVG selection was done before CellFateGenie.
- **GPU out of memory**: CellFateGenie uses PyTorchRidge when CUDA is available. For large datasets, it falls back to CPU automatically, but you can force CPU by setting `CUDA_VISIBLE_DEVICES=""`.
- **`lineage_score` returns all zeros**: The specified lineage clusters may have too few cells. Check cluster sizes with `adata.obs[cluster_key].value_counts()`.

## Examples
- "Find genes driving erythroid differentiation along my Palantir pseudotime."
- "Run CellFateGenie to identify fate-associated genes and plot the ATR filtering curve."
- "Score lineage-specific genes for the monocyte branch in my trajectory."

## References
- Tutorials: `t_cellfate.ipynb`, `t_cellfate_gene.ipynb`, `t_cellfate_genesets.ipynb`
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill implements CellFateGenie: an end-to-end workflow to discover pseudotime-associated and lineage-driving genes using adaptive threshold ridge regression and manifold density scoring. It combines Adaptive Threshold Regression (ATR) for compact gene selection with Mellon-based low-density detection to prioritize fate-driving genes along trajectories. Use it after you have computed pseudotime for single-cell or bulk trajectory analyses.

How this skill works

First, a ridge regression model is fit to predict pseudotime from gene expression, producing gene coefficients and baseline fit metrics. ATR then iteratively removes low-coefficient genes while monitoring R² to find the smallest gene set that preserves fit. Optionally, Mellon density estimation on a diffusion manifold detects low-density transition regions, and a lineage_score routine computes lineage-specific expression variability to flag fate-driving genes.

When to use it

You have pseudotime values computed (Palantir, VIA, DPT, etc.) and want genes correlated with progression.
You need a compact set of genes that explains pseudotime with minimal loss of model fit.
You want to find genes specifically driving a particular lineage or branch of a trajectory.
Your dataset is single-cell or bulk RNA-seq where log-normalized expression is available.
You plan to combine trajectory results with downstream visualization or gene-set analysis.

Best practices

Ensure pseudotime exists in adata.obs and contains no NaNs before initializing Fate.
Run HVG selection and PCA (X_pca in adata.obsm) beforehand for more stable results.
Install mellon (pip install mellon) before running low_density(); otherwise skip density steps.
Tune ATR 'flux' and 'stop' to balance feature parsimony and R² preservation (lower flux keeps more genes).
Use data augmentation when pseudotime estimates are noisy (jitter or expression noise options).
If GPU memory is limited, force CPU or accept automatic fallback to sklearn Ridge.

Example use cases

Identify top pseudotime-associated genes driving erythroid differentiation from Palantir outputs.
Filter a genome-wide model to a parsimonious gene set that maintains high R² with ATR.
Locate low-density transition regions to time branch commitment events using Mellon.
Score and rank genes that vary specifically in a user-defined lineage (cluster list).
Apply the same pipeline to scATAC peak counts via atac_init and get_related_peak.

FAQ

What happens if mellon is not installed?

low_density() will fail; ATR and model fitting still work. Install mellon with pip install mellon to enable density-based steps.

How do I choose the ATR flux parameter?

flux controls allowed R² drop from the maximum. Lower flux preserves more genes (safer); higher flux yields a smaller gene set. Start with default 0.01 and adjust based on desired parsimony.