home / skills / gptomics / bioskills / star-alignment

star-alignment skill

/read-alignment/star-alignment

This skill helps align RNA-seq reads with STAR in two-pass mode for improved splice-aware accuracy and downstream fusion detection.

npx playbooks add skill gptomics/bioskills --skill star-alignment

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.4 KB
---
name: bio-read-alignment-star-alignment
description: Align RNA-seq reads with STAR (Spliced Transcripts Alignment to a Reference). Supports two-pass mode for novel splice junction discovery. Use when aligning RNA-seq data requiring splice-aware alignment.
tool_type: cli
primary_tool: STAR
---

## Version Compatibility

Reference examples tested with: STAR 2.7.11+, Subread 2.0+, fastp 0.23+, kallisto 0.50+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# STAR RNA-seq Alignment

**"Align RNA-seq reads with STAR"** → Map RNA-seq reads to a reference genome with fast, sensitive splice-aware alignment. Preferred for large datasets and downstream fusion/chimeric read detection.
- CLI: `STAR --runMode alignReads --genomeDir index/ --readFilesIn R1.fq R2.fq --outSAMtype BAM SortedByCoordinate`

## Generate Genome Index

```bash
# Basic index generation
STAR --runMode genomeGenerate \
    --runThreadN 8 \
    --genomeDir star_index/ \
    --genomeFastaFiles reference.fa \
    --sjdbGTFfile annotation.gtf \
    --sjdbOverhang 100    # Read length - 1
```

## Index with Specific Read Length

```bash
# For 150bp reads, use sjdbOverhang=149
STAR --runMode genomeGenerate \
    --runThreadN 8 \
    --genomeDir star_index_150/ \
    --genomeFastaFiles reference.fa \
    --sjdbGTFfile annotation.gtf \
    --sjdbOverhang 149
```

## Basic Alignment

```bash
# Paired-end alignment
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn reads_1.fq.gz reads_2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate
```

## Single-End Alignment

```bash
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn reads.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate
```

## Two-Pass Mode

```bash
# Two-pass mode for better novel junction detection
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn r1.fq.gz r2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --twopassMode Basic
```

## Quantification Mode

```bash
# Output gene counts (like featureCounts)
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn r1.fq.gz r2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode GeneCounts
```

Output: `sample_ReadsPerGene.out.tab` with columns:
1. Gene ID
2. Unstranded counts
3. Forward strand counts
4. Reverse strand counts

## ENCODE Options

```bash
# ENCODE recommended settings
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn r1.fq.gz r2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMunmapped Within \
    --outSAMattributes NH HI AS NM MD \
    --outFilterType BySJout \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 999 \
    --outFilterMismatchNoverReadLmax 0.04 \
    --alignIntronMin 20 \
    --alignIntronMax 1000000 \
    --alignMatesGapMax 1000000 \
    --alignSJoverhangMin 8 \
    --alignSJDBoverhangMin 1
```

## Fusion Detection

```bash
# For chimeric/fusion detection
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn r1.fq.gz r2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --chimSegmentMin 12 \
    --chimJunctionOverhangMin 8 \
    --chimOutType Junctions WithinBAM SoftClip \
    --chimMainSegmentMultNmax 1
```

## Output Files

| File | Description |
|------|-------------|
| *Aligned.sortedByCoord.out.bam | Sorted BAM file |
| *Log.final.out | Alignment summary statistics |
| *Log.out | Detailed log |
| *SJ.out.tab | Splice junctions |
| *ReadsPerGene.out.tab | Gene counts (if --quantMode) |
| *Chimeric.out.junction | Fusion candidates (if chimeric) |

## Memory Requirements

```bash
# Reduce memory for limited systems
STAR --genomeLoad NoSharedMemory \
    --limitBAMsortRAM 10000000000 \  # 10GB for sorting
    ...

# For very large genomes, limit during index generation
STAR --runMode genomeGenerate \
    --limitGenomeGenerateRAM 31000000000 \  # 31GB
    ...
```

## Shared Memory Mode

```bash
# Load genome into shared memory (for multiple samples)
STAR --genomeLoad LoadAndExit --genomeDir star_index/

# Run alignments (faster startup)
STAR --genomeLoad LoadAndKeep --genomeDir star_index/ ...

# Remove from memory when done
STAR --genomeLoad Remove --genomeDir star_index/
```

## Key Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| --runThreadN | 1 | Number of threads |
| --sjdbOverhang | 100 | Read length - 1 |
| --outFilterMultimapNmax | 10 | Max multi-mapping |
| --alignIntronMax | 0 | Max intron size |
| --outFilterMismatchNmax | 10 | Max mismatches |
| --outSAMtype | SAM | Output format |
| --quantMode | - | GeneCounts for counting |
| --twopassMode | None | Basic for two-pass |

## Related Skills

- rna-quantification/featurecounts-counting - Alternative counting
- rna-quantification/alignment-free-quant - Salmon/kallisto alternative
- differential-expression/deseq2-basics - Downstream DE analysis
- read-qc/fastp-workflow - Preprocess reads

Overview

This skill aligns RNA-seq reads to a reference genome using STAR (Spliced Transcripts Alignment to a Reference). It supports single- and paired-end data, two-pass mode for novel splice junction discovery, and options for quantification and fusion detection. The skill is optimized for large datasets and downstream analyses that require splice-aware mapping.

How this skill works

The skill generates or uses an existing STAR genome index tuned to read length (sjdbOverhang) and runs STAR with thread and memory controls. It can perform single-pass or two-pass alignments, produce sorted BAM output, splice junction tables, gene counts, and chimeric junction reports. It exposes ENCODE-recommended flags, shared-memory loading for multi-sample runs, and parameters to reduce memory footprint on constrained systems.

When to use it

  • Mapping RNA-seq reads that require splice-aware alignment
  • Detecting novel splice junctions (use two-pass mode)
  • Preparing BAMs for fusion/chimeric read detection
  • Generating gene-level counts directly from alignments
  • Scaling alignment across many samples using shared memory

Best practices

  • Build genome index with sjdbOverhang = read_length - 1 to improve junction mapping
  • Use two-pass mode for experiments expected to contain novel splice sites
  • Enable --quantMode GeneCounts when you want quick gene-level counts alongside alignments
  • Tune --runThreadN and --limitBAMsortRAM to match available CPU and memory
  • Use shared memory (--genomeLoad) when aligning many samples on the same node to reduce startup time

Example use cases

  • Paired-end RNA-seq alignment with sorted BAM output for differential expression pipelines
  • Two-pass alignment to improve discovery of novel splice junctions in tumor samples
  • Run with chimeric options to detect candidate gene fusions for cancer fusion discovery
  • Generate ReadsPerGene.out.tab during alignment for a fast QC of gene-level counts
  • Load genome into shared memory and run many sample alignments sequentially to save time and RAM

FAQ

What read length should I set for sjdbOverhang?

Set sjdbOverhang equal to read_length - 1 (for 150 bp reads use 149).

How do I reduce memory usage on a small server?

Use --genomeLoad NoSharedMemory, lower --limitBAMsortRAM, and set --limitGenomeGenerateRAM during index generation to fit available RAM.