home / skills / starlitnightly / omicverse / single-cellfate-analysis

single-cellfate-analysis skill

/.claude/skills/single-cellfate-analysis

This skill identifies pseudotime-associated genes driving lineage decisions by adaptive ridge regression and Mellon-based density scoring.

npx playbooks add skill starlitnightly/omicverse --skill single-cellfate-analysis

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
8.6 KB
---
name: cellfate-pseudotime-gene-analysis
title: CellFateGenie pseudotime gene analysis
description: "CellFateGenie: Adaptive Threshold Regression for pseudotime-associated gene discovery, Mellon density, lineage scoring via ov.single.Fate."
---

# CellFateGenie Analysis

Use this skill when the user wants to identify genes that drive cell fate decisions along a developmental trajectory. CellFateGenie discovers pseudotime-associated genes using adaptive ridge regression and then scores lineage-specific fate-driving genes via manifold density estimation.

This skill is used **after trajectory inference** — the user must already have pseudotime values computed (e.g., from Palantir, VIA, or diffusion pseudotime).

## Overview

CellFateGenie answers: "Which genes change most significantly along pseudotime, and which are specifically driving a particular lineage?" It works in two phases:

1. **Gene selection** — Adaptive Threshold Regression (ATR) iteratively removes low-coefficient genes while monitoring R² to find the minimal gene set that explains pseudotime
2. **Lineage scoring** — Mellon density estimation on the manifold identifies low-density transition regions, and lineage-specific variability scoring pinpoints fate-driving genes

## Prerequisites

- **Pseudotime**: Must exist as a column in `adata.obs`. Compute first using Palantir, VIA, DPT, or any trajectory method.
- **Mellon** (optional but important): `pip install mellon` for density estimation. Without it, `low_density()` will fail.
- **Expression data**: Works on the `.X` matrix. Log-normalized data is fine (unlike SCENIC which needs raw counts).

## Pipeline Steps

### 1. Initialize Fate object

```python
import omicverse as ov

# pseudotime: column name in adata.obs containing pseudotime values
fate = ov.single.Fate(adata, pseudotime='dpt_pseudotime')
# Automatically uses GPU (PyTorchRidge) if CUDA available, else sklearn Ridge on CPU
```

### 2. Initial ridge regression (model_init)

```python
coef_df = fate.model_init(
    test_size=0.3,           # Train/test split ratio
    alpha=0.1,               # Ridge regularization strength
    use_data_augmentation=False,  # Enable for noisy pseudotime
)
# Returns: DataFrame of gene coefficients
# Stores: fate.coef (all coefficients), fate.raw_r2, fate.raw_mse
```

### 3. Adaptive Threshold Regression (ATR) — feature selection

This is the core innovation. ATR iteratively removes genes with the smallest coefficients and monitors when R² starts dropping significantly:

```python
threshold_df = fate.ATR(
    test_size=0.4,
    alpha=0.1,
    stop=500,    # Maximum iterations. Increase for more genes (default 100).
    flux=0.01,   # R² drop tolerance. When R² drops by more than flux from max, stop.
)
# Sets fate.coef_threshold internally
# Visualize the filtering curve:
fate.plot_filtering()  # Shows R² vs iteration, marks optimal threshold
```

### 4. Refit on selected genes (model_fit)

```python
filter_coef_df = fate.model_fit(
    test_size=0.3,
    alpha=0.1,
)
# Returns: DataFrame of coefficients for genes above threshold only
# Stores: fate.filter_coef
# Compare: fate.get_r2('raw') vs fate.get_r2('filter') — filter R² should be close to raw
```

### 5. Statistical validation (kendalltau_filter)

```python
kendall_df = fate.kendalltau_filter()
# Computes Kendall's tau rank correlation for each filtered gene vs pseudotime
# Returns: DataFrame with kendalltau_sta and pvalue per gene
# Confirms monotonic relationship — genes with high |tau| are truly pseudotime-associated
```

### 6. Mellon density estimation (low_density)

```python
fate.low_density(
    n_components=10,     # Diffusion map components for manifold representation
    knn=30,              # k-nearest neighbors for density estimation
    alpha=0.0,           # Mellon regularization
    seed=0,
    pca_key='X_pca',     # PCA embedding to use
)
# Stores: adata.obs['mellon_log_density_lowd']
# Low-density regions = developmental transition points (branching, commitment)
```

### 7. Lineage-specific scoring (lineage_score)

```python
fate.lineage_score(
    cluster_key='leiden',              # Clustering column in adata.obs
    lineage=['20', '17'],              # Cluster labels defining the lineage of interest
    cell_mask='specification',         # How to select cells: 'specification' uses lineage list
    density_key='mellon_log_density_lowd',
)
# Stores: adata.var['change_scores_lineage']
# High scores = genes with high expression variability specifically in that lineage
```

### 8. Identify fate-driving genes

```python
# Intersect ATR-selected genes with lineage-specific scores
fate_genes = adata.var.loc[fate.filter_coef.index, 'change_scores_lineage']
top_fate_genes = fate_genes.sort_values(ascending=False).head(20)
print(top_fate_genes)
```

## Data Augmentation

For noisy pseudotime estimates, enable augmentation to improve robustness:

```python
fate.model_init(
    use_data_augmentation=True,
    augmentation_strategy='jitter_pseudotime_noise',  # or 'gene_expression_noise', 'both'
    augmentation_intensity=0.05,  # Noise magnitude (fraction of range)
)
# Same parameters available in ATR() and model_fit()
```

## ATAC Mode

CellFateGenie also works with scATAC-seq data:

```python
fate.atac_init(...)          # Initialize for ATAC peak data
fate.get_related_peak(...)   # Find peaks associated with fate genes
```

## Visualization

```python
# ATR filtering curve — shows R² vs iteration
fate.plot_filtering(figsize=(3, 3))

# Model fit quality
fate.plot_fitting(type='raw')     # All genes
fate.plot_fitting(type='filter')  # ATR-selected genes only

# Color-coded by cluster
fate.plot_color_fitting(type='filter', cluster_key='leiden')
```

## Critical API Reference

### Pseudotime column must exist in adata.obs

```python
# CORRECT
fate = ov.single.Fate(adata, pseudotime='dpt_pseudotime')

# WRONG — column doesn't exist
# fate = ov.single.Fate(adata, pseudotime='pseudotime')  # KeyError if not in adata.obs
```

### ATR flux controls sensitivity

The `flux` parameter (default 0.01) determines when ATR stops removing genes. Lower flux = more genes retained (stricter R² preservation). Higher flux = fewer genes (more aggressive filtering).

### low_density requires mellon package

```python
# WRONG — mellon not installed
# fate.low_density()  # ImportError: No module named 'mellon'

# FIX
# pip install mellon
```

## Defensive Validation Patterns

```python
# Verify pseudotime column exists
assert pseudotime_col in adata.obs.columns, \
    f"Pseudotime column '{pseudotime_col}' not in adata.obs. Compute trajectory first."

# Verify pseudotime has valid values (no NaN)
import numpy as np
assert not adata.obs[pseudotime_col].isna().any(), \
    f"Pseudotime column contains NaN. Filter cells or impute missing values."

# Verify mellon is installed (before low_density)
try:
    import mellon
except ImportError:
    print("WARNING: mellon not installed. Run: pip install mellon")

# Verify PCA exists (needed for low_density)
assert 'X_pca' in adata.obsm, "PCA required for low_density(). Run ov.pp.pca(adata) first."

# Verify lineage clusters exist (before lineage_score)
for cl in lineage_list:
    assert cl in adata.obs[cluster_key].values, \
        f"Cluster '{cl}' not found in adata.obs['{cluster_key}']. Available: {adata.obs[cluster_key].unique()}"
```

## Troubleshooting

- **`KeyError` on pseudotime column**: The column name passed to `Fate()` doesn't exist in `adata.obs`. Check with `adata.obs.columns.tolist()`.
- **`ImportError: No module named 'mellon'`**: Install with `pip install mellon`. This is required for `low_density()` but not for ATR/model_fit.
- **Low R² after ATR (<0.3)**: Pseudotime may not be well-correlated with gene expression. Try a different trajectory method or increase `stop` iterations.
- **`plot_filtering()` shows flat curve**: The dataset may have too few variable genes. Ensure HVG selection was done before CellFateGenie.
- **GPU out of memory**: CellFateGenie uses PyTorchRidge when CUDA is available. For large datasets, it falls back to CPU automatically, but you can force CPU by setting `CUDA_VISIBLE_DEVICES=""`.
- **`lineage_score` returns all zeros**: The specified lineage clusters may have too few cells. Check cluster sizes with `adata.obs[cluster_key].value_counts()`.

## Examples
- "Find genes driving erythroid differentiation along my Palantir pseudotime."
- "Run CellFateGenie to identify fate-associated genes and plot the ATR filtering curve."
- "Score lineage-specific genes for the monocyte branch in my trajectory."

## References
- Tutorials: `t_cellfate.ipynb`, `t_cellfate_gene.ipynb`, `t_cellfate_genesets.ipynb`
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill implements CellFateGenie: an end-to-end workflow to discover pseudotime-associated and lineage-driving genes using adaptive threshold ridge regression and manifold density scoring. It combines Adaptive Threshold Regression (ATR) for compact gene selection with Mellon-based low-density detection to prioritize fate-driving genes along trajectories. Use it after you have computed pseudotime for single-cell or bulk trajectory analyses.

How this skill works

First, a ridge regression model is fit to predict pseudotime from gene expression, producing gene coefficients and baseline fit metrics. ATR then iteratively removes low-coefficient genes while monitoring R² to find the smallest gene set that preserves fit. Optionally, Mellon density estimation on a diffusion manifold detects low-density transition regions, and a lineage_score routine computes lineage-specific expression variability to flag fate-driving genes.

When to use it

  • You have pseudotime values computed (Palantir, VIA, DPT, etc.) and want genes correlated with progression.
  • You need a compact set of genes that explains pseudotime with minimal loss of model fit.
  • You want to find genes specifically driving a particular lineage or branch of a trajectory.
  • Your dataset is single-cell or bulk RNA-seq where log-normalized expression is available.
  • You plan to combine trajectory results with downstream visualization or gene-set analysis.

Best practices

  • Ensure pseudotime exists in adata.obs and contains no NaNs before initializing Fate.
  • Run HVG selection and PCA (X_pca in adata.obsm) beforehand for more stable results.
  • Install mellon (pip install mellon) before running low_density(); otherwise skip density steps.
  • Tune ATR 'flux' and 'stop' to balance feature parsimony and R² preservation (lower flux keeps more genes).
  • Use data augmentation when pseudotime estimates are noisy (jitter or expression noise options).
  • If GPU memory is limited, force CPU or accept automatic fallback to sklearn Ridge.

Example use cases

  • Identify top pseudotime-associated genes driving erythroid differentiation from Palantir outputs.
  • Filter a genome-wide model to a parsimonious gene set that maintains high R² with ATR.
  • Locate low-density transition regions to time branch commitment events using Mellon.
  • Score and rank genes that vary specifically in a user-defined lineage (cluster list).
  • Apply the same pipeline to scATAC peak counts via atac_init and get_related_peak.

FAQ

What happens if mellon is not installed?

low_density() will fail; ATR and model fitting still work. Install mellon with pip install mellon to enable density-based steps.

How do I choose the ATR flux parameter?

flux controls allowed R² drop from the maximum. Lower flux preserves more genes (safer); higher flux yields a smaller gene set. Start with default 0.01 and adjust based on desired parsimony.