home / skills / starlitnightly / omicverse / fastq-analysis

fastq-analysis skill

/.claude/skills/fastq-analysis

This skill guides end-to-end FASTQ-to-count analysis in OmicVerse, automating download, QC, alignment, quantification, and single-cell workflows.

npx playbooks add skill starlitnightly/omicverse --skill fastq-analysis

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
7.8 KB
---
name: fastq-analysis-pipeline
title: FASTQ analysis and RNA-seq alignment with omicverse
description: Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows.
---

## Overview

OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the `ov.alignment` module. This skill covers:

- **SRA data acquisition**: `prefetch` and `fqdump` (fasterq-dump wrapper)
- **Quality control**: `fastp` for adapter trimming and QC reports
- **RNA-seq alignment**: `STAR` aligner with auto-index building
- **Gene quantification**: `featureCount` (subread featureCounts wrapper)
- **Single-cell path**: `ref` and `count` via kb-python (kallisto/bustools)
- **Parallel SRA download**: `parallel_fastq_dump`

All functions share a common CLI infrastructure (`_cli_utils.py`) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.

## Instructions

1. **Environment setup**
   - Bioinformatics tools are resolved automatically from PATH or the active conda environment.
   - If `auto_install=True` (default), missing tools are installed via mamba/conda on demand.
   - Supported tools: `prefetch`, `vdb-validate`, `fasterq-dump`, `fastp`, `STAR`, `samtools`, `featureCounts`, `pigz`, `gzip`.
   - For the single-cell path, ensure `kb-python` is installed: `pip install kb-python`.

2. **SRA data download** (`ov.alignment.prefetch` + `ov.alignment.fqdump`)
   - Use `prefetch` first for reliable downloads with integrity validation (`vdb-validate`).
   - Then convert to FASTQ with `fqdump`. It auto-detects single-end vs paired-end.
   - `fqdump` can also work directly from SRR accessions without prefetch.
   - Both support retry with exponential backoff for network errors.
   ```python
   import omicverse as ov

   # Step 1: Prefetch SRA files (optional but recommended)
   pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)

   # Step 2: Convert to FASTQ
   fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'],
                             output_dir='fastq', sra_dir='prefetch',
                             gzip=True, threads=8, jobs=4)
   ```

3. **FASTQ quality control** (`ov.alignment.fastp`)
   - Runs fastp for adapter trimming, quality filtering, and QC reporting.
   - Supports single-end and paired-end reads.
   - Produces per-sample JSON and HTML QC reports.
   - Sample format: tuple of `(sample_name, fq1_path, fq2_path_or_None)`.
   ```python
   samples = [
       ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'),
       ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'),
   ]
   clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)
   ```

4. **STAR alignment** (`ov.alignment.STAR`)
   - Aligns FASTQ reads using the STAR aligner.
   - **Auto-index building**: set `auto_index=True` (default) with `genome_fasta_files` and `gtf` to build index automatically if missing.
   - Produces coordinate-sorted BAM files.
   - Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
   - Use `strict=False` (default) for graceful error handling per sample.
   ```python
   # Prepare samples from fastp output
   star_samples = [
       ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'),
       ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'),
   ]
   bams = ov.alignment.STAR(
       star_samples,
       genome_dir='star_index',
       output_dir='star_out',
       gtf='genes.gtf',
       genome_fasta_files=['genome.fa'],
       threads=8,
       memory='50G',
   )
   ```

5. **Gene quantification** (`ov.alignment.featureCount`)
   - Counts aligned reads per gene using featureCounts (subread).
   - Auto-detects paired-end from BAM headers (via pysam or samtools).
   - `auto_fix=True` (default) retries with corrected paired-end flag on error.
   - `gene_mapping=True` maps gene_id to gene_name from the GTF.
   - `merge_matrix=True` produces a combined count matrix across all samples.
   ```python
   bam_items = [
       ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'),
       ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'),
   ]
   counts = ov.alignment.featureCount(
       bam_items,
       gtf='genes.gtf',
       output_dir='counts',
       gene_mapping=True,
       merge_matrix=True,
       threads=8,
   )
   # counts is a pandas DataFrame (gene_id x samples)
   ```

6. **Single-cell path** (`ov.alignment.ref` + `ov.alignment.count`)
   - Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
   - `ref()` builds a kallisto index and transcript-to-gene mapping.
   - `count()` quantifies single-cell data with barcode/UMI handling.
   - Supports technologies: 10XV2, 10XV3, BULK, and custom.
   - Output formats: h5ad, loom, cellranger MTX.
   ```python
   # Build reference index
   ref_result = ov.alignment.ref(
       index_path='kb_ref/index.idx',
       t2g_path='kb_ref/t2g.txt',
       fasta_paths=['genome.fa'],
       gtf_paths=['genes.gtf'],
       threads=8,
   )

   # Quantify 10x v3 data
   count_result = ov.alignment.count(
       index_path='kb_ref/index.idx',
       t2g_path='kb_ref/t2g.txt',
       technology='10XV3',
       fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'],
       output_path='kb_out',
       h5ad=True,
       filter_barcodes=True,
       threads=8,
   )
   ```

7. **Wiring fastp output into STAR input**
   - fastp output is a list of dicts with keys: `sample`, `clean1`, `clean2`, `json`, `html`.
   - Convert to STAR sample tuples:
   ```python
   star_samples = [
       (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None)
       for r in (clean if isinstance(clean, list) else [clean])
   ]
   ```

8. **Wiring STAR output into featureCount input**
   - STAR output is a list of dicts with keys: `sample`, `bam` (or `error`).
   - Convert to featureCount items:
   ```python
   bam_items = [
       (r['sample'], r['bam'])
       for r in (bams if isinstance(bams, list) else [bams])
       if 'bam' in r
   ]
   ```

9. **Skipping completed steps**
   - All functions check for existing outputs and skip if `overwrite=False` (default).
   - Set `overwrite=True` to force re-execution.

10. **Troubleshooting**
    - If a tool is not found, check `auto_install=True` and that conda/mamba is accessible.
    - For STAR index errors, ensure `genome_fasta_files` points to uncompressed or gzip FASTA files.
    - For featureCounts paired-end detection errors, `auto_fix=True` handles most cases automatically.
    - GTF files can be gzip-compressed; they are auto-decompressed as needed.

## Critical API Reference

### Sample Format Convention

All alignment functions use a consistent sample tuple format:
- **FASTQ samples**: `(sample_name, fq1_path, fq2_path_or_None)`
- **BAM items**: `(sample_name, bam_path)` or `(sample_name, bam_path, is_paired_bool)`
- Single samples can be passed as a single tuple; multiple as a list of tuples.
- When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.

### Auto-installation

```python
# All functions support these parameters:
auto_install=True   # Auto-install missing tools via conda/mamba
overwrite=False     # Skip if outputs already exist
threads=8           # Per-tool thread count
jobs=None           # Concurrent job count (auto-detected from CPU count)
```

## Examples

- **Bulk RNA-seq from SRA**: `prefetch` -> `fqdump` -> `fastp` -> `STAR` -> `featureCount` -> pandas DataFrame
- **Single-cell 10x v3**: `ref` -> `count` with `technology='10XV3'` -> h5ad AnnData
- **Local FASTQ files**: Skip download steps, start directly with `fastp` -> `STAR` -> `featureCount`

## References

- See [reference.md](reference.md) for copy-paste-ready code templates.

Overview

This skill guides users through OmicVerse's alignment module to run end-to-end FASTQ-to-count workflows for bulk and single-cell RNA-seq. It covers SRA acquisition, FASTQ quality control, STAR alignment, gene quantification with featureCounts, and single-cell kallisto|bustools via kb-python. The skill emphasizes reproducible, parallel execution with automatic tool resolution and optional auto-installation.

How this skill works

The module exposes functions that wrap standard bioinformatics tools (prefetch, fasterq-dump, fastp, STAR, samtools, featureCounts, kb-python) and share a common CLI utility for resolving and auto-installing binaries. Workflows are composable: download SRA to FASTQ, run fastp for trimming/QC, align with STAR to produce coordinate-sorted BAMs, and quantify genes with featureCounts. For single-cell data, kb-python is used to build references and generate count matrices (h5ad/loom/MTX). Each function detects existing outputs and supports parallel jobs, retries, and streaming logs.

When to use it

  • Download and validate SRA datasets and convert to FASTQ before analysis.
  • Trim, filter, and generate per-sample QC reports with fastp.
  • Run STAR alignment with automatic genome index building for bulk RNA-seq.
  • Produce gene-level count matrices from BAMs using featureCounts for downstream differential analysis.
  • Build kallisto|bustools references and quantify single-cell 10x or custom scRNA-seq libraries.

Best practices

  • Run prefetch before fqdump for reliable SRA downloads and integrity checks.
  • Keep auto_install=True in new environments to let the module resolve missing tools via conda/mamba.
  • Use fastp outputs directly as STAR inputs by converting lists of result dicts to sample tuples.
  • Enable auto_index=True for STAR to avoid manual index steps; provide genome fasta and GTF paths.
  • Set overwrite=False to skip completed steps; use overwrite=True only to force re-runs when required.

Example use cases

  • Bulk RNA-seq from SRA: prefetch -> fqdump -> fastp -> STAR -> featureCount -> pandas DataFrame counts.
  • Local paired-end FASTQs: skip download, run fastp -> STAR -> featureCount for a lightweight local pipeline.
  • Single-cell 10x v3: build kb-python reference with ref(), then count() to produce h5ad AnnData for downstream analysis.
  • Parallel SRA processing: use fqdump and parallel_fastq_dump with jobs and threads to accelerate downloads and conversions.

FAQ

Do tools need to be preinstalled?

No. The module resolves tools from PATH or the active conda env and will auto-install missing tools via mamba/conda when auto_install=True.

How does paired-end detection work for featureCounts?

Paired-end status is auto-detected from BAM headers; if an error occurs, auto_fix=True triggers retries with corrected flags.