home / skills / gptomics / bioskills / trajectory-inference

trajectory-inference skill

/single-cell/trajectory-inference

This skill enables end-to-end inference of developmental trajectories and pseudotime from single-cell RNA-seq data using Monocle3, Slingshot, and scVelo.

npx playbooks add skill gptomics/bioskills --skill trajectory-inference

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
7.0 KB
---
name: bio-single-cell-trajectory-inference
description: Infer developmental trajectories and pseudotime from single-cell RNA-seq data using Monocle3, Slingshot, and scVelo for RNA velocity analysis. Use when inferring developmental trajectories or pseudotime.
tool_type: mixed
primary_tool: Monocle3
---

## Version Compatibility

Reference examples tested with: Cell Ranger 8.0+, scanpy 1.10+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Trajectory Inference

## Monocle3 (R)

**Goal:** Infer developmental trajectories and pseudotime ordering using Monocle3's principal graph approach.

**Approach:** Learn a principal graph through the data manifold, order cells along the graph from a root state, and extract pseudotime values.

**"Find the developmental trajectory in my data"** → Construct a tree-like graph through the cell embedding, assign pseudotime from a root population, and identify branch points.

```r
library(monocle3)

# Create cell_data_set from Seurat
cds <- as.cell_data_set(seurat_obj)

# Preprocess (if not already done)
cds <- preprocess_cds(cds, num_dim = 50)
cds <- reduce_dimension(cds, reduction_method = 'UMAP')

# Cluster cells
cds <- cluster_cells(cds)

# Learn trajectory graph
cds <- learn_graph(cds)

# Order cells (select root interactively or programmatically)
cds <- order_cells(cds, root_cells = root_cell_ids)

# Plot trajectory with pseudotime
plot_cells(cds, color_cells_by = 'pseudotime', label_branch_points = TRUE, label_leaves = TRUE)

# Get pseudotime values
pseudotime <- pseudotime(cds)
```

## Set Root Programmatically

**Goal:** Automatically select the trajectory root node based on a known progenitor cluster.

**Approach:** Identify the principal graph node closest to cells in the specified progenitor cluster and use it as the root for pseudotime ordering.

```r
# Find root by marker gene expression
get_earliest_principal_node <- function(cds, cluster_name) {
    cell_ids <- which(colData(cds)$seurat_clusters == cluster_name)
    closest_vertex <- cds@principal_graph_aux[['UMAP']]$pr_graph_cell_proj_closest_vertex
    closest_vertex <- as.matrix(closest_vertex[cell_ids, ])
    root_pr_nodes <- igraph::V(principal_graph(cds)[['UMAP']])$name[as.numeric(names(which.max(table(closest_vertex))))]
    root_pr_nodes
}

cds <- order_cells(cds, root_pr_nodes = get_earliest_principal_node(cds, 'stem_cluster'))
```

## Slingshot (R)

**Goal:** Infer smooth lineage trajectories and pseudotime using Slingshot's minimum spanning tree and principal curves.

**Approach:** Build a minimum spanning tree through cluster centroids to define lineage structure, then fit smooth principal curves for per-lineage pseudotime.

```r
library(slingshot)
library(SingleCellExperiment)

# From Seurat object
sce <- as.SingleCellExperiment(seurat_obj)
reducedDims(sce)$UMAP <- Embeddings(seurat_obj, 'umap')

# Run slingshot
sce <- slingshot(sce, clusterLabels = 'seurat_clusters', reducedDim = 'UMAP')

# Get pseudotime for each lineage
pseudotime_mat <- slingPseudotime(sce)

# Get lineage curves
curves <- slingCurves(sce)

# Plot trajectories
plot(reducedDims(sce)$UMAP, col = sce$seurat_clusters, pch = 16)
lines(SlingshotDataSet(sce), lwd = 2)
```

## Slingshot with Start/End Clusters

```r
# Specify starting cluster
sce <- slingshot(sce, clusterLabels = 'seurat_clusters', reducedDim = 'UMAP', start.clus = 'HSC')

# Specify start and end
sce <- slingshot(sce, clusterLabels = 'seurat_clusters', reducedDim = 'UMAP',
                 start.clus = 'HSC', end.clus = c('Erythroid', 'Myeloid'))
```

## scVelo RNA Velocity (Python)

**Goal:** Estimate RNA velocity to predict future cell states from spliced/unspliced transcript ratios.

**Approach:** Model the dynamics of splicing using stochastic or dynamical models, compute velocity vectors, and project directional flow onto UMAP.

```python
import scvelo as scv
import scanpy as sc

# Load data with spliced/unspliced counts
adata = scv.read('data.h5ad')

# Or merge loom files from velocyto
ldata = scv.read('velocyto_output.loom')
adata = scv.utils.merge(adata, ldata)

# Preprocess
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

# Compute velocity (stochastic model)
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)

# Visualize velocity streams
scv.pl.velocity_embedding_stream(adata, basis='umap', color='clusters')
```

## scVelo Dynamical Model

```python
# More accurate but slower
scv.tl.recover_dynamics(adata, n_jobs=8)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

# Latent time (pseudotime)
scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', cmap='gnuplot')

# Velocity confidence
scv.tl.velocity_confidence(adata)
scv.pl.scatter(adata, color=['velocity_confidence', 'velocity_length'])
```

## Gene Dynamics Along Trajectory

```r
# Monocle3: Find genes varying over pseudotime
graph_test_res <- graph_test(cds, neighbor_graph = 'principal_graph', cores = 4)
sig_genes <- graph_test_res %>% filter(q_value < 0.05) %>% arrange(desc(morans_I))

# Plot gene expression over pseudotime
plot_genes_in_pseudotime(cds[rownames(cds) %in% top_genes, ], color_cells_by = 'cluster')
```

```python
# scVelo: Top likelihood genes
scv.tl.rank_velocity_genes(adata, groupby='clusters', min_corr=0.3)
top_genes = adata.uns['rank_velocity_genes']['names']

# Plot phase portraits
scv.pl.velocity(adata, var_names=['gene1', 'gene2'], basis='umap')
```

## Branch Point Analysis

```r
# Monocle3: Genes differentially expressed at branch points
branch_genes <- graph_test(cds, neighbor_graph = 'principal_graph', cores = 4)

# Slingshot + tradeSeq for branch analysis
library(tradeSeq)
sce <- fitGAM(sce, nknots = 6)
branch_res <- earlyDETest(sce, knots = c(3, 4))
```

## Velocyto Preprocessing

```bash
# Generate loom file with spliced/unspliced counts
velocyto run10x -m repeat_mask.gtf /path/to/cellranger_output annotation.gtf

# For SmartSeq2
velocyto run_smartseq2 -o output -m repeat_mask.gtf -e sample bam_files/*.bam annotation.gtf
```

## PAGA Trajectory (Scanpy)

```python
import scanpy as sc

# Compute PAGA
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, color='leiden', threshold=0.03)

# PAGA-initialized UMAP
sc.tl.draw_graph(adata, init_pos='paga')
sc.pl.draw_graph(adata, color='leiden')

# Diffusion pseudotime
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden'] == 'root_cluster')[0]
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color='dpt_pseudotime')
```

## Related Skills

- single-cell/clustering - Prerequisite clustering
- single-cell/cell-communication - Downstream signaling analysis
- differential-expression/deseq2-basics - DE along trajectory

Overview

This skill infers developmental trajectories and pseudotime from single-cell RNA-seq data using Monocle3, Slingshot, and scVelo. It supports principal-graph, principal-curve, and RNA-velocity approaches so you can order cells, find branch points, and predict future states. Use it to combine complementary methods for robust trajectory inference and dynamic gene analysis.

How this skill works

The skill provides ready-to-run patterns: Monocle3 (R) to learn a principal graph and assign pseudotime from a chosen root; Slingshot (R) to build MST-based lineages and fit smooth principal curves per lineage; and scVelo (Python) to estimate RNA velocity from spliced/unspliced counts using stochastic or dynamical models and compute latent time. It also includes examples for programmatic root selection, branch tests, velocity visualization, and integrating loom/velocyto outputs.

When to use it

  • You need a pseudotemporal ordering of cells to study differentiation or development.
  • You want to detect lineage branch points and identify branch-specific genes.
  • You have spliced/unspliced data and want to predict future cell states with RNA velocity.
  • You need both topology-based (Monocle3/Slingshot) and directionality-based (scVelo) evidence.
  • You plan downstream tests for genes varying along pseudotime or between branches.

Best practices

  • Verify package and tool versions (scanpy, scVelo, Monocle3, Slingshot) before running examples.
  • Preprocess consistently: normalized counts, dimensionality reduction (PCA/UMAP), and clustering first.
  • Choose the root with biological knowledge (marker clusters) or programmatically from progenitor cells.
  • Use dynamical scVelo for higher accuracy when computing latent time, but expect longer runtimes.
  • Validate inferred trajectories with known markers and compare methods (Monocle3 vs Slingshot vs scVelo).

Example use cases

  • Order hematopoietic single-cell data to reconstruct differentiation from HSCs to mature lineages.
  • Identify genes changing at a branch point using Monocle3 graph_test and plot_genes_in_pseudotime.
  • Estimate RNA velocity on 10x data to visualize directional flows on UMAP and compute latent time.
  • Fit Slingshot curves to clustered UMAP embeddings to obtain lineage-specific pseudotime matrices.
  • Combine velocyto-generated loom files with scVelo to add spliced/unspliced information into analysis.

FAQ

What input formats are supported?

Seurat objects and SingleCellExperiment for R workflows; AnnData (.h5ad) and velocyto loom files for Python/scVelo.

How do I choose root cells for pseudotime?

Prefer known progenitor clusters or marker-expressing cells; examples show programmatic selection of principal graph nodes closest to a progenitor cluster.

When should I use dynamical scVelo vs stochastic?

Use stochastic for quick exploratory analysis; use dynamical for more accurate velocity and latent time estimates when compute resources permit.