home / skills / starlitnightly / omicverse / bulk-deg-analysis

bulk-deg-analysis skill

safe

This skill guides you through bulk RNA-seq differential expression analysis in omicverse, from gene ID mapping to visualization and pathway enrichment.

npx playbooks add skill starlitnightly/omicverse --skill bulk-deg-analysis

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

5.0 KB

---
name: bulk-rna-seq-differential-expression-with-omicverse
title: Bulk RNA-seq differential expression with omicverse
description: Guide Claude through omicverse's bulk RNA-seq DEG pipeline, from gene ID mapping and DESeq2 normalization to statistical testing, visualization, and pathway enrichment. Use when a user has bulk count matrices and needs differential expression analysis in omicverse.
---

# Bulk RNA-seq differential expression with omicverse

## Overview
Follow this skill to run the end-to-end differential expression (DEG) workflow showcased in [`t_deg.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deg.ipynb). It assumes the user provides a raw gene-level count matrix (e.g., from featureCounts) and wants to analyse bulk RNA-seq cohorts inside omicverse.

## Instructions
1. **Set up the session**
   - Import `omicverse as ov`, `scanpy as sc`, and `matplotlib.pyplot as plt`.
   - Call `ov.plot_set()` so downstream plots adopt omicverse styling.
2. **Prepare ID mapping assets**
   - When gene IDs must be converted to gene symbols, instruct the user to download mapping pairs via `ov.utils.download_geneid_annotation_pair()` and store them under `genesets/`.
   - Mention the available prebuilt genomes (T2T-CHM13, GRCh38, GRCh37, GRCm39, danRer7, danRer11) and that users can generate their own mapping from GTF files if needed.
3. **Load the raw counts**
   - Read tab-delimited featureCounts output with `ov.pd.read_csv(..., sep='\t', header=1, index_col=0)`.
   - Strip trailing `.bam` segments from column names using list comprehension so sample IDs are clean.
4. **Map gene identifiers**
   - Run `ov.bulk.Matrix_ID_mapping(counts_df, 'genesets/pair_<GENOME>.tsv')` to replace `gene_id` entries with gene symbols.
5. **Initialise the DEG object**
   - Create `dds = ov.bulk.pyDEG(mapped_counts)`.
   - Handle duplicate gene symbols with `dds.drop_duplicates_index()` to keep the highest expressed version.
6. **Normalise and estimate size factors**
   - Execute `dds.normalize()` to calculate DESeq2 size factors, correcting for library size and batch differences.
7. **Run differential testing**
   - Collect treatment and control replicate labels into lists.
   - Call `dds.deg_analysis(treatment_groups, control_groups, method='ttest')` for the default Welch t-test.
   - Offer optional alternatives: `method='edgepy'` for edgeR-like tests and `method='limma'` for limma-style modelling.
8. **Filter and threshold results**
   - Note that lowly expressed genes are retained by default; filter using `dds.result.loc[dds.result['log2(BaseMean)'] > 1]` when needed.
   - Set dynamic fold-change and significance cutoffs via `dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6)` (`fc_threshold=-1` auto-selects based on log2FC distribution).
9. **Visualise differential expression**
   - Produce volcano plots with `dds.plot_volcano(title=..., figsize=..., plot_genes=... or plot_genes_num=...)` to highlight key genes.
   - Generate per-gene boxplots using `dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=..., legend_bbox=...)`; adjust y-axis tick labels if required.
10. **Perform pathway enrichment (optional)**
    - Download curated pathway libraries through `ov.utils.download_pathway_database()`.
    - Load genesets with `ov.utils.geneset_prepare(<path>, organism='Mouse'|'Human'|...)`.
    - Build the DEG gene list from `dds.result.loc[dds.result['sig'] != 'normal'].index`.
    - Run enrichment with `ov.bulk.geneset_enrichment(gene_list=deg_genes, pathways_dict=..., pvalue_type='auto', organism=...)`. Encourage users without internet access to provide a `background` gene list.
    - Visualise single-library results via `ov.bulk.geneset_plot(...)` and combine multiple ontologies using `ov.bulk.geneset_plot_multi(enr_dict, colors_dict, num=...)`.
11. **Document outputs**
    - Suggest exporting `dds.result` and enrichment tables to CSV for downstream reporting.
    - Encourage users to save figures generated by matplotlib (`plt.savefig(...)`) when running outside notebooks.
12. **Troubleshooting tips**
    - Ensure sample labels in `treatment_groups`/`control_groups` exactly match column names post-cleanup.
    - Verify required packages (`omicverse`, `pyComplexHeatmap`, `gseapy`) are installed for enrichment visualisations.
    - Remind users that internet access is required the first time they download gene mappings or pathway databases.

## Examples
- "I have a featureCounts matrix for mouse tumour samples—normalize it with DESeq2, run t-test DEG, and highlight the top 8 genes in a volcano plot."
- "Use omicverse to compute edgeR-style differential expression between treated and control replicates, then run GO enrichment on significant genes."
- "Guide me through converting Ensembl IDs to symbols, performing limma DEG, and plotting boxplots for Krtap9-5 and Lef1."

## References
- Detailed walkthrough notebook: [`t_deg.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deg.ipynb)
- Sample count matrix for testing: [`sample/counts.txt`](../../sample/counts.txt)
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill guides Claude through omicverse's end-to-end bulk RNA-seq differential expression (DEG) pipeline, from raw gene-level counts to normalized statistics, visualization, and pathway enrichment. It assumes you have featureCounts-style count matrices and want reproducible DEG results inside omicverse. The instructions cover ID mapping, DESeq2-style normalization, statistical testing options, plotting, and enrichment workflows.

How this skill works

The skill walks through preparing mapping assets, loading and cleaning raw counts, and converting gene identifiers to symbols. It shows how to build a pyDEG object, normalize using DESeq2 size factors, run DEG tests (Welch t-test by default, or edgepy/limma alternatives), filter results, and generate volcano and per-gene boxplots. Optional pathway enrichment and multi-library visualization steps use downloadable gene sets and built-in plotting functions.

When to use it

You have raw gene-level count matrices (featureCounts) and need DEG analysis.
You must convert Ensembl/Ensembl-like IDs to gene symbols before analysis.
You want DESeq2-style normalization but flexible testing (ttest, edgepy, limma).
You need publication-ready volcano plots and per-gene boxplots from omicverse.
You want to run pathway enrichment on DEG results using curated libraries.

Best practices

Download and store gene ID mapping pairs under genesets/ before analysis to ensure reproducible mappings.
Clean sample column names (strip .bam suffixes) so group labels exactly match counts columns.
Drop duplicate gene symbols keeping the highest expressed entry to avoid ambiguous rows.
Run dds.normalize() to compute size factors prior to any differential testing.
Filter lowly expressed genes (e.g., log2(BaseMean) > 1) before setting fold-change and p-value thresholds.
Save dds.result and enrichment tables to CSV and export figures with plt.savefig() when running non-interactively.

Example use cases

Normalize a mouse tumour featureCounts matrix with DESeq2, run t-test DEG, and plot the top 8 genes on a volcano plot.
Convert Ensembl IDs to symbols, run limma-style modelling for a complex design, and plot boxplots for selected marker genes.
Run edgepy-style DEG between treated and control replicates and perform GO/pathway enrichment on significant genes.
Download pathway databases, prepare genesets for Human or Mouse, and generate combined multi-ontology enrichment plots.

FAQ

What genome mappings are available by default?

Prebuilt mappings include T2T-CHM13, GRCh38, GRCh37, GRCm39, danRer7, and danRer11; you can also generate mappings from local GTF files.

Which DEG methods can I use?

Default is Welch t-test (method='ttest'). Alternatives are method='edgepy' for edgeR-like tests and method='limma' for limma-style modelling.

Do I need internet access to run enrichment?

Internet is required the first time to download gene mappings or pathway databases; subsequent runs can use local copies. Provide a background gene list if internet is unavailable.