home / skills / starlitnightly / omicverse / fastq-analysis
This skill guides end-to-end FASTQ-to-count analysis in OmicVerse, automating download, QC, alignment, quantification, and single-cell workflows.
npx playbooks add skill starlitnightly/omicverse --skill fastq-analysisReview the files below or copy the command above to add this skill to your agents.
---
name: fastq-analysis-pipeline
title: FASTQ analysis and RNA-seq alignment with omicverse
description: Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows.
---
## Overview
OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the `ov.alignment` module. This skill covers:
- **SRA data acquisition**: `prefetch` and `fqdump` (fasterq-dump wrapper)
- **Quality control**: `fastp` for adapter trimming and QC reports
- **RNA-seq alignment**: `STAR` aligner with auto-index building
- **Gene quantification**: `featureCount` (subread featureCounts wrapper)
- **Single-cell path**: `ref` and `count` via kb-python (kallisto/bustools)
- **Parallel SRA download**: `parallel_fastq_dump`
All functions share a common CLI infrastructure (`_cli_utils.py`) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.
## Instructions
1. **Environment setup**
- Bioinformatics tools are resolved automatically from PATH or the active conda environment.
- If `auto_install=True` (default), missing tools are installed via mamba/conda on demand.
- Supported tools: `prefetch`, `vdb-validate`, `fasterq-dump`, `fastp`, `STAR`, `samtools`, `featureCounts`, `pigz`, `gzip`.
- For the single-cell path, ensure `kb-python` is installed: `pip install kb-python`.
2. **SRA data download** (`ov.alignment.prefetch` + `ov.alignment.fqdump`)
- Use `prefetch` first for reliable downloads with integrity validation (`vdb-validate`).
- Then convert to FASTQ with `fqdump`. It auto-detects single-end vs paired-end.
- `fqdump` can also work directly from SRR accessions without prefetch.
- Both support retry with exponential backoff for network errors.
```python
import omicverse as ov
# Step 1: Prefetch SRA files (optional but recommended)
pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)
# Step 2: Convert to FASTQ
fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'],
output_dir='fastq', sra_dir='prefetch',
gzip=True, threads=8, jobs=4)
```
3. **FASTQ quality control** (`ov.alignment.fastp`)
- Runs fastp for adapter trimming, quality filtering, and QC reporting.
- Supports single-end and paired-end reads.
- Produces per-sample JSON and HTML QC reports.
- Sample format: tuple of `(sample_name, fq1_path, fq2_path_or_None)`.
```python
samples = [
('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'),
('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'),
]
clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)
```
4. **STAR alignment** (`ov.alignment.STAR`)
- Aligns FASTQ reads using the STAR aligner.
- **Auto-index building**: set `auto_index=True` (default) with `genome_fasta_files` and `gtf` to build index automatically if missing.
- Produces coordinate-sorted BAM files.
- Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
- Use `strict=False` (default) for graceful error handling per sample.
```python
# Prepare samples from fastp output
star_samples = [
('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'),
('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'),
]
bams = ov.alignment.STAR(
star_samples,
genome_dir='star_index',
output_dir='star_out',
gtf='genes.gtf',
genome_fasta_files=['genome.fa'],
threads=8,
memory='50G',
)
```
5. **Gene quantification** (`ov.alignment.featureCount`)
- Counts aligned reads per gene using featureCounts (subread).
- Auto-detects paired-end from BAM headers (via pysam or samtools).
- `auto_fix=True` (default) retries with corrected paired-end flag on error.
- `gene_mapping=True` maps gene_id to gene_name from the GTF.
- `merge_matrix=True` produces a combined count matrix across all samples.
```python
bam_items = [
('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'),
('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'),
]
counts = ov.alignment.featureCount(
bam_items,
gtf='genes.gtf',
output_dir='counts',
gene_mapping=True,
merge_matrix=True,
threads=8,
)
# counts is a pandas DataFrame (gene_id x samples)
```
6. **Single-cell path** (`ov.alignment.ref` + `ov.alignment.count`)
- Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
- `ref()` builds a kallisto index and transcript-to-gene mapping.
- `count()` quantifies single-cell data with barcode/UMI handling.
- Supports technologies: 10XV2, 10XV3, BULK, and custom.
- Output formats: h5ad, loom, cellranger MTX.
```python
# Build reference index
ref_result = ov.alignment.ref(
index_path='kb_ref/index.idx',
t2g_path='kb_ref/t2g.txt',
fasta_paths=['genome.fa'],
gtf_paths=['genes.gtf'],
threads=8,
)
# Quantify 10x v3 data
count_result = ov.alignment.count(
index_path='kb_ref/index.idx',
t2g_path='kb_ref/t2g.txt',
technology='10XV3',
fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'],
output_path='kb_out',
h5ad=True,
filter_barcodes=True,
threads=8,
)
```
7. **Wiring fastp output into STAR input**
- fastp output is a list of dicts with keys: `sample`, `clean1`, `clean2`, `json`, `html`.
- Convert to STAR sample tuples:
```python
star_samples = [
(r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None)
for r in (clean if isinstance(clean, list) else [clean])
]
```
8. **Wiring STAR output into featureCount input**
- STAR output is a list of dicts with keys: `sample`, `bam` (or `error`).
- Convert to featureCount items:
```python
bam_items = [
(r['sample'], r['bam'])
for r in (bams if isinstance(bams, list) else [bams])
if 'bam' in r
]
```
9. **Skipping completed steps**
- All functions check for existing outputs and skip if `overwrite=False` (default).
- Set `overwrite=True` to force re-execution.
10. **Troubleshooting**
- If a tool is not found, check `auto_install=True` and that conda/mamba is accessible.
- For STAR index errors, ensure `genome_fasta_files` points to uncompressed or gzip FASTA files.
- For featureCounts paired-end detection errors, `auto_fix=True` handles most cases automatically.
- GTF files can be gzip-compressed; they are auto-decompressed as needed.
## Critical API Reference
### Sample Format Convention
All alignment functions use a consistent sample tuple format:
- **FASTQ samples**: `(sample_name, fq1_path, fq2_path_or_None)`
- **BAM items**: `(sample_name, bam_path)` or `(sample_name, bam_path, is_paired_bool)`
- Single samples can be passed as a single tuple; multiple as a list of tuples.
- When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.
### Auto-installation
```python
# All functions support these parameters:
auto_install=True # Auto-install missing tools via conda/mamba
overwrite=False # Skip if outputs already exist
threads=8 # Per-tool thread count
jobs=None # Concurrent job count (auto-detected from CPU count)
```
## Examples
- **Bulk RNA-seq from SRA**: `prefetch` -> `fqdump` -> `fastp` -> `STAR` -> `featureCount` -> pandas DataFrame
- **Single-cell 10x v3**: `ref` -> `count` with `technology='10XV3'` -> h5ad AnnData
- **Local FASTQ files**: Skip download steps, start directly with `fastp` -> `STAR` -> `featureCount`
## References
- See [reference.md](reference.md) for copy-paste-ready code templates.
This skill guides users through OmicVerse's alignment module to run end-to-end FASTQ-to-count workflows for bulk and single-cell RNA-seq. It covers SRA acquisition, FASTQ quality control, STAR alignment, gene quantification with featureCounts, and single-cell kallisto|bustools via kb-python. The skill emphasizes reproducible, parallel execution with automatic tool resolution and optional auto-installation.
The module exposes functions that wrap standard bioinformatics tools (prefetch, fasterq-dump, fastp, STAR, samtools, featureCounts, kb-python) and share a common CLI utility for resolving and auto-installing binaries. Workflows are composable: download SRA to FASTQ, run fastp for trimming/QC, align with STAR to produce coordinate-sorted BAMs, and quantify genes with featureCounts. For single-cell data, kb-python is used to build references and generate count matrices (h5ad/loom/MTX). Each function detects existing outputs and supports parallel jobs, retries, and streaming logs.
Do tools need to be preinstalled?
No. The module resolves tools from PATH or the active conda env and will auto-install missing tools via mamba/conda when auto_install=True.
How does paired-end detection work for featureCounts?
Paired-end status is auto-detected from BAM headers; if an error occurs, auto_fix=True triggers retries with corrected flags.