home / skills / starlitnightly / omicverse / single-scenic-grn

single-scenic-grn skill

/.claude/skills/single-scenic-grn

This skill infers gene regulatory networks from scRNA-seq, prunes regulons, and scores regulon activity for cell-type resolution.

npx playbooks add skill starlitnightly/omicverse --skill single-scenic-grn

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
8.6 KB
---
name: scenic-gene-regulatory-network
title: SCENIC gene regulatory network
description: "SCENIC gene regulatory network: RegDiffusion GRN inference, cisTarget regulon pruning, AUCell scoring, RSS, regulon embeddings in OmicVerse."
---

# SCENIC Gene Regulatory Network Analysis

Use this skill when the user wants to infer transcription factor (TF) regulatory networks, identify regulons (TF + target gene sets), score regulon activity per cell, or find master regulators for specific cell types. SCENIC reconstructs gene regulatory networks from scRNA-seq data using a 3-stage pipeline.

## Overview: 3-Stage Pipeline

1. **GRN inference** — Predict TF → target gene links using RegDiffusion (deep learning, 10x faster than legacy GRNBoost2)
2. **Regulon pruning** — Validate links with cisTarget motif enrichment databases, keeping only direct targets
3. **AUCell scoring** — Quantify regulon activity per cell, enabling regulon-based clustering and cell type characterization

## Prerequisites

### Data requirements
- **Raw counts** (NOT log-transformed). RegDiffusion needs count-level variance structure.
- HVG-filtered to ~3000 genes for tractable runtime.
- Cell type annotations in `adata.obs` (for downstream RSS analysis).

### Database downloads (CRITICAL — most common failure point)

cisTarget ranking databases and motif annotations must be downloaded before analysis. These are species-specific and ~1-2 GB each.

**Mouse (mm10)**:
```bash
mkdir -p scenic_db
# Ranking databases (two resolution variants)
wget -P scenic_db/ https://resources.aertslab.org/cistarget/databases/mus_musculus/mm10/refseq_r80/mc_v10_clust/gene_based/mm10_500bp_up_100bp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather
wget -P scenic_db/ https://resources.aertslab.org/cistarget/databases/mus_musculus/mm10/refseq_r80/mc_v10_clust/gene_based/mm10_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather
# Motif annotations
wget -P scenic_db/ https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.mgi-m0.001-o0.0.tbl
```

**Human (hg38)**:
```bash
mkdir -p scenic_db
wget -P scenic_db/ https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_500bp_up_100bp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather
wget -P scenic_db/ https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather
wget -P scenic_db/ https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl
```

## Pipeline Steps

### 1. Initialize SCENIC

```python
import omicverse as ov
import glob

db_glob = 'scenic_db/*.feather'       # Glob pattern matching ranking DBs
motif_path = 'scenic_db/motifs-*.tbl' # Path to motif annotation file
# n_jobs: CPU workers for parallel processing. Match to available cores.
scenic = ov.single.SCENIC(adata, db_glob=db_glob, motif_path=motif_path, n_jobs=8)
```

### 2. GRN inference with RegDiffusion

```python
edgelist = scenic.cal_grn(method='regdiffusion', layer='counts')
# method: 'regdiffusion' (recommended, deep learning) or legacy methods
# layer: must contain RAW counts — not log-normalized. Use 'counts', 'raw_count', or 'X' if X has raw counts.
# Returns: DataFrame with columns [TF, target, importance]
```

### 3. Regulon discovery and AUCell scoring

```python
regulon_ad = scenic.cal_regulons(rho_mask_dropouts=True, seed=42)
# rho_mask_dropouts: ignore zero entries in correlation (handles dropout noise)
# Internally: builds co-expression modules → cisTarget motif pruning → AUCell activity scoring
# Typical compression: ~10k modules → ~70 regulons
```

After this step, `scenic` stores:
- `scenic.adjacencies` — TF→target edge list with importance scores
- `scenic.regulons` — list of Regulon objects (TF, target genes, weights)
- `scenic.auc_mtx` — cells × regulons activity matrix (AUCell scores)
- `scenic.modules` — raw co-expression modules before pruning

### 4. Downstream analysis

**Regulon Specificity Scores (RSS)** — identify master regulators per cell type:
```python
from pyscenic.utils import modules_to_regulons
from pyscenic.binarize import binarize
rss = ov.single.regulon_specificity_scores(scenic.auc_mtx, adata.obs['cell_type'])
# Returns: (cell_types × regulons) DataFrame, Jensen-Shannon divergence scores (0-1)
# Higher = more specific to that cell type
```

**Binary activity matrix** — convert continuous AUCell to on/off:
```python
binary_mtx, thresholds = binarize(scenic.auc_mtx, num_workers=8)
```

**Visualization**:
```python
# Regulon activity on embedding
ov.pl.embedding(regulon_ad, basis='X_umap', color=['Ets1(+)', 'E2f8(+)'])

# GRN network graph
ov.single.plot_grn(G, pos, tf_list, temporal_df, tf_gene_dict, top_tf_target_num=5)
```

## Critical API Reference

### `db_glob` must be a glob pattern matching .feather files

```python
# CORRECT — glob pattern
scenic = ov.single.SCENIC(adata, db_glob='scenic_db/*.feather', motif_path='scenic_db/motifs-v10nr_clust-nr.mgi-m0.001-o0.0.tbl', n_jobs=8)

# WRONG — single file path (needs at least one .feather match)
# scenic = ov.single.SCENIC(adata, db_glob='scenic_db/mm10_500bp.feather', ...)  # May work but misses second DB
```

### `layer` in cal_grn must contain RAW counts

```python
# CORRECT — specify raw count layer
edgelist = scenic.cal_grn(layer='counts')      # If raw counts in adata.layers['counts']
edgelist = scenic.cal_grn(layer='raw_count')   # Alternative layer name

# WRONG — using log-normalized data
# edgelist = scenic.cal_grn(layer='X')  # If X is log-normalized, GRN inference quality degrades severely
```

### Gene names must match the species database

Mouse databases expect mixed-case gene symbols (e.g., `Tp53`, `Cd4`). Human databases expect uppercase (e.g., `TP53`, `CD4`). A mismatch → most genes unmatched → very few regulons recovered.

## Defensive Validation Patterns

```python
import os, glob as gl

# Verify cisTarget databases exist
db_files = gl.glob(db_glob)
assert len(db_files) >= 1, f"No cisTarget .feather files found matching '{db_glob}'. Download databases first."

# Verify motif annotation file exists
assert os.path.isfile(motif_path), f"Motif file not found: {motif_path}. Download species-specific .tbl file."

# Verify raw counts (not log-transformed)
import numpy as np
if hasattr(adata.X, 'toarray'):
    max_val = adata.X.toarray().max()
else:
    max_val = adata.X.max()
if max_val < 20:
    print("WARNING: Max expression value is low — data may be log-transformed. SCENIC needs raw counts.")

# Verify gene name format matches database species
sample_genes = list(adata.var_names[:5])
if 'mm10' in db_glob or 'mgi' in motif_path:
    # Mouse DB expects mixed-case (Tp53)
    if all(g.isupper() for g in sample_genes):
        print("WARNING: Gene names are all uppercase but using mouse database. Check gene name format.")
elif 'hg38' in db_glob or 'hgnc' in motif_path:
    # Human DB expects uppercase (TP53)
    if any(g[0].isupper() and g[1:].islower() for g in sample_genes if len(g) > 1):
        print("WARNING: Gene names look like mouse format but using human database.")
```

## Troubleshooting

- **`No ranking databases found`**: The `db_glob` pattern doesn't match any `.feather` files. Check the path and ensure databases are downloaded. Use `glob.glob(db_glob)` to debug.
- **`Empty regulons (0 regulons after pruning)`**: Usually means gene names don't match the database species. Mouse genes are mixed-case (Actb), human are uppercase (ACTB). Also check: are you using raw counts?
- **`Log-transformed data passed to RegDiffusion`**: RegDiffusion needs variance structure from raw counts. If `adata.X.max() < 20`, the data is likely log-transformed. Use `adata.layers['counts']` or expm1 to recover raw counts.
- **`MemoryError during regulon pruning`**: cisTarget enrichment is memory-intensive. Reduce HVG count (2000 instead of 3000) or increase swap. Also try reducing `n_jobs`.
- **`Low regulon recovery rate (10k modules → <10 regulons)`**: Pruning thresholds may be too strict. Adjust `cal_regulons(thresholds=(0.5, 0.75))` for more permissive filtering. Or increase `top_n_targets=(100,)`.
- **`GUROBI license required` (CEFCON only)**: CEFCON's integer linear programming prefers GUROBI (academic license free). Fallback to SCIP is slower but works without license.

## Examples
- "Run SCENIC on my mouse hematopoiesis data to find master regulators per cell type."
- "Infer gene regulatory networks from scRNA-seq and visualize TF-target relationships."
- "Score regulon activity per cell and identify cell-type-specific transcription factors."

## References
- Tutorial: `t_scenic.ipynb`
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill implements SCENIC gene regulatory network analysis inside OmicVerse for bulk, single-cell, and spatial RNA-seq. It infers TF→target links with RegDiffusion, prunes regulons by cisTarget motif enrichment, and scores regulon activity per cell with AUCell. The output includes regulons, AUCell activity matrices, RSS for cell-type specificity, and visualizations for embedding and GRN graphs.

How this skill works

The pipeline runs three stages: (1) GRN inference using RegDiffusion on raw count data to predict TF→target importance, (2) regulon pruning with cisTarget motif rankings and motif-to-TF annotations to keep direct targets, and (3) AUCell scoring to quantify regulon activity in each cell. The skill validates database presence, checks raw-count layers and gene-name conventions, and stores adjacencies, regulons, AUCell matrices, and intermediate modules for downstream analysis.

When to use it

  • Inferring transcription factor regulatory networks from single-cell or bulk RNA-seq counts.
  • Identifying regulons (TF + direct target gene sets) and master regulators for cell types.
  • Scoring regulon activity per cell for regulon-based clustering or cell-type annotation.
  • Comparing regulon specificity across annotated cell types using RSS.
  • Visualizing TF activity on embeddings and plotting TF→target networks for interpretation.

Best practices

  • Provide raw (non-log) counts in a named layer (e.g., adata.layers['counts']) — RegDiffusion requires count-level variance.
  • Download species-appropriate cisTarget .feather ranking databases and motif .tbl files before running; missing DBs are the most common failure.
  • Filter HVGs to ~2000–3000 genes for tractable runtime and memory; reduce HVGs if pruning causes MemoryError.
  • Ensure gene name case matches the database species (mouse mixed-case, human uppercase) to avoid empty regulons.
  • Set n_jobs to match available CPU cores; reduce parallelism if memory is constrained.
  • Validate existence of db_glob with glob.glob and check motif_path file before starting.

Example use cases

  • Run SCENIC on mouse hematopoiesis scRNA-seq to find master regulators per annotated cell type.
  • Infer GRNs from spatial transcriptomics counts and map regulon activity onto tissue embeddings.
  • Score and binarize regulon activity to identify on/off programs for downstream differential activity tests.
  • Prune co-expression modules to direct regulons and visualize top TF→target edges for publication figures.
  • Compute RSS to rank TFs specific to rare cell populations for follow-up experiments.

FAQ

What input layer should I use for GRN inference?

Use a layer containing raw counts (not log-transformed). Typical names: 'counts' or 'raw_count'. If max expression <20, your data may be log-transformed.

Why do I get zero or very few regulons after pruning?

Most cases are due to gene-name mismatch with the cisTarget database or using log-transformed inputs. Check gene case (mouse mixed-case, human uppercase) and ensure raw counts are supplied.