home / skills / starlitnightly / omicverse / bulk-deseq2-analysis

bulk-deseq2-analysis skill

safe

This skill guides you through PyDESeq2-based bulk RNA-seq differential expression analysis with ID mapping, filtering, visualization, and enrichment in

npx playbooks add skill starlitnightly/omicverse --skill bulk-deseq2-analysis

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

3.6 KB

---
name: bulk-rna-seq-deseq2-analysis-with-omicverse
title: Bulk RNA-seq DESeq2 analysis with omicverse
description: Walk Claude through PyDESeq2-based differential expression, including ID mapping, DE testing, fold-change thresholding, and enrichment visualisation.
---

# Bulk RNA-seq DESeq2 analysis with omicverse

## Overview
Use this skill when a user wants to reproduce the DESeq2 workflow showcased in [`t_deseq2.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deseq2.ipynb). It covers loading raw featureCounts matrices, mapping Ensembl IDs to symbols, running PyDESeq2 via `ov.bulk.pyDEG`, and exploring downstream enrichment plots.

## Instructions
1. **Import and format the expression matrix**
   - Call `import omicverse as ov` and `ov.utils.ov_plot_set()` to standardise visuals.
   - Read tab-separated count data from featureCounts using `ov.utils.read(..., index_col=0, header=1)`.
   - Strip trailing `.bam` from column names with `[c.split('/')[-1].replace('.bam', '') for c in data.columns]`.
2. **Map gene identifiers**
   - Ensure the appropriate mapping pair exists by running `ov.utils.download_geneid_annotation_pair()`.
   - Replace `gene_id` with gene symbols using `ov.bulk.Matrix_ID_mapping(data, 'genesets/pair_<GENOME>.tsv')`.
3. **Initialise the DEG object**
   - Create `dds = ov.bulk.pyDEG(data)` from the mapped counts.
   - Resolve duplicate gene names with `dds.drop_duplicates_index()` and confirm success in logs.
4. **Define contrasts and run DESeq2**
   - Collect sample labels into `treatment_groups` and `control_groups` lists that match column names exactly.
   - Execute `dds.deg_analysis(treatment_groups, control_groups, method='DEseq2')` to invoke PyDESeq2.
5. **Filter and tune thresholds**
   - Inspect result shape (`dds.result.shape`) and optionally filter low-expression genes, e.g. `dds.result.loc[dds.result['log2(BaseMean)'] > 1]`.
   - Set thresholds via `dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6)` to auto-pick fold-change cutoffs.
6. **Visualise differential genes**
   - Draw volcano plots with `dds.plot_volcano(...)` and summarise key genes.
   - Produce per-gene boxplots: `dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=(2, 3))`.
7. **Run enrichment analyses (optional)**
   - Download enrichment libraries using `ov.utils.download_pathway_database()` and load them through `ov.utils.geneset_prepare`.
   - Rank genes for GSEA with `rnk = dds.ranking2gsea()`.
   - Instantiate `gsea_obj = ov.bulk.pyGSEA(rnk, pathway_dict)` and call `gsea_obj.enrichment()` to compute terms.
   - Plot enrichment bubble charts via `gsea_obj.plot_enrichment(...)` and GSEA curves with `gsea_obj.plot_gsea(term_num=..., ...)`.
8. **Troubleshooting**
   - If PyDESeq2 raises errors about size factors, remind users to provide raw counts (not log-transformed data).
   - `gene_id` mapping depends on species; direct them to download the correct genome pair when results look sparse.
   - Large pathway libraries may require raising recursion limits or filtering to the top N terms before plotting.

## Examples
- "Run PyDESeq2 on treated vs control replicates and highlight the top enriched WikiPathways terms."
- "Filter DEGs to genes with log2(BaseMean) > 1, auto-select fold-change cutoffs, and create volcano and boxplots."
- "Generate the ranked gene list for GSEA and plot the enrichment curve for the top pathway."

## References
- Tutorial notebook: [`t_deseq2.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deseq2.ipynb)
- Sample featureCounts matrix: [`sample/counts.txt`](../../sample/counts.txt)
- Quick copy/paste commands: [`reference.md`](reference.md)

Overview

This skill walks Claude through a reproducible PyDESeq2-based bulk RNA-seq differential expression workflow using the omicverse library. It covers loading featureCounts matrices, mapping Ensembl IDs to gene symbols, running DE testing, tuning fold-change and p-value thresholds, and visualising results and enrichment. The notes emphasise practical commands, common pitfalls, and downstream GSEA plotting.

How this skill works

Start by importing omicverse, reading raw count tables, and cleaning sample column names. Map gene identifiers to symbols, initialise a pyDEG object from the counts, and run DESeq2 contrasts with treatment and control group lists. Filter results by expression and significance, create volcano and per-gene boxplots, and optionally prepare ranked lists for GSEA and plot enrichment summaries.

When to use it

You have raw featureCounts count matrices and want a reproducible DESeq2 workflow in Python.
You need to map Ensembl IDs to gene symbols before differential testing.
You want automated fold-change threshold selection and visual QC (volcano, boxplots).
You plan to run GSEA or pathway enrichment on ranked DE results.
You want code-ready snippets to integrate PyDESeq2 into a notebook pipeline.

Best practices

Provide raw, untransformed counts (not log-normalised) to PyDESeq2 to avoid size factor errors.
Download and use the correct gene ID mapping pair for your species before ID conversion.
Remove duplicate gene names with dds.drop_duplicates_index() and confirm via logs.
Inspect BaseMean or log2(BaseMean) to filter low-expression genes prior to thresholding.
Limit large pathway libraries or raise recursion limits before plotting many enrichment terms.

Example use cases

Run PyDESeq2 for treated vs control replicates and plot volcano and top gene boxplots.
Filter DEGs to BaseMean > threshold, auto-select fold-change cutoffs, and export significant genes.
Generate a ranked gene list (rnk) from DE results and run GSEA for WikiPathways or KEGG.
Download pathway databases, compute enrichment with pyGSEA, and plot bubble charts and GSEA curves.
Troubleshoot size-factor or mapping issues when results appear sparse or inconsistent.

FAQ

What input format does this workflow expect?

A tab-separated featureCounts matrix with genes as rows and raw counts per sample as columns; header row 1 and gene_id as index.

Why am I getting size factor or PyDESeq2 errors?

Most often because the input was transformed (log/CPM). Re-run using raw integer counts and ensure no prior normalisation.