home / skills / gptomics / bioskills / hisat2-alignment

hisat2-alignment skill

/read-alignment/hisat2-alignment

This skill helps align RNA-seq reads using HISAT2, offering memory-efficient, splice-aware alignment for efficient gene expression workflows.

npx playbooks add skill gptomics/bioskills --skill hisat2-alignment

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.4 KB
---
name: bio-read-alignment-hisat2-alignment
description: Align RNA-seq reads with HISAT2, a memory-efficient splice-aware aligner. Use when STAR's memory requirements are too high or for general RNA-seq alignment.
tool_type: cli
primary_tool: HISAT2
---

## Version Compatibility

Reference examples tested with: samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# HISAT2 RNA-seq Alignment

**"Align RNA-seq reads with HISAT2"** → Map RNA-seq reads to a reference genome with splice-aware alignment. Suitable for gene expression quantification workflows.
- CLI: `hisat2 -x index -1 R1.fq -2 R2.fq | samtools sort -o aligned.bam`

## Build Index

```bash
# Basic index (no annotation)
hisat2-build -p 8 reference.fa hisat2_index

# Index with splice sites and exons (recommended)
hisat2_extract_splice_sites.py annotation.gtf > splice_sites.txt
hisat2_extract_exons.py annotation.gtf > exons.txt

hisat2-build -p 8 \
    --ss splice_sites.txt \
    --exon exons.txt \
    reference.fa hisat2_index
```

## Basic Alignment

```bash
# Paired-end reads
hisat2 -p 8 -x hisat2_index \
    -1 reads_1.fq.gz -2 reads_2.fq.gz \
    -S aligned.sam

# Single-end reads
hisat2 -p 8 -x hisat2_index \
    -U reads.fq.gz \
    -S aligned.sam
```

## Direct to Sorted BAM

```bash
# Pipe to samtools
hisat2 -p 8 -x hisat2_index \
    -1 r1.fq.gz -2 r2.fq.gz | \
    samtools sort -@ 4 -o aligned.sorted.bam -

samtools index aligned.sorted.bam
```

## Stranded Libraries

```bash
# Forward stranded (e.g., Ligation)
hisat2 -p 8 -x hisat2_index \
    --rna-strandness FR \
    -1 r1.fq.gz -2 r2.fq.gz -S aligned.sam

# Reverse stranded (e.g., dUTP, TruSeq - most common)
hisat2 -p 8 -x hisat2_index \
    --rna-strandness RF \
    -1 r1.fq.gz -2 r2.fq.gz -S aligned.sam

# Single-end stranded
hisat2 -p 8 -x hisat2_index \
    --rna-strandness F \    # or R for reverse
    -U reads.fq.gz -S aligned.sam
```

## Novel Splice Junction Discovery

```bash
# Output novel splice junctions
hisat2 -p 8 -x hisat2_index \
    --novel-splicesite-outfile novel_splices.txt \
    -1 r1.fq.gz -2 r2.fq.gz -S aligned.sam

# Use known + novel junctions for subsequent alignments
hisat2 -p 8 -x hisat2_index \
    --novel-splicesite-infile novel_splices.txt \
    -1 r1.fq.gz -2 r2.fq.gz -S aligned.sam
```

## Two-Pass Alignment (Manual)

**Goal:** Improve splice junction sensitivity by discovering novel junctions across all samples in a first pass, then realigning with the combined junction set.

**Approach:** Run HISAT2 on each sample to extract novel splice sites, merge and deduplicate junctions across samples, then realign all samples using the combined junction catalog.

```bash
# Pass 1: Discover junctions from all samples
for r1 in *_R1.fq.gz; do
    r2=${r1/_R1/_R2}
    base=$(basename $r1 _R1.fq.gz)
    hisat2 -p 8 -x hisat2_index \
        --novel-splicesite-outfile ${base}_splices.txt \
        -1 $r1 -2 $r2 -S /dev/null
done

# Combine and filter junctions
cat *_splices.txt | sort -u > combined_splices.txt

# Pass 2: Realign with all junctions
for r1 in *_R1.fq.gz; do
    r2=${r1/_R1/_R2}
    base=$(basename $r1 _R1.fq.gz)
    hisat2 -p 8 -x hisat2_index \
        --novel-splicesite-infile combined_splices.txt \
        -1 $r1 -2 $r2 | \
        samtools sort -@ 4 -o ${base}.sorted.bam -
done
```

## Read Group Information

```bash
hisat2 -p 8 -x hisat2_index \
    --rg-id sample1 \
    --rg SM:sample1 \
    --rg PL:ILLUMINA \
    --rg LB:lib1 \
    -1 r1.fq.gz -2 r2.fq.gz -S aligned.sam
```

## Downstream Quantification

```bash
# Output name-sorted BAM for htseq-count
hisat2 -p 8 -x hisat2_index -1 r1.fq.gz -2 r2.fq.gz | \
    samtools sort -n -@ 4 -o aligned.namesorted.bam -

# Or coordinate-sorted for featureCounts
hisat2 -p 8 -x hisat2_index -1 r1.fq.gz -2 r2.fq.gz | \
    samtools sort -@ 4 -o aligned.sorted.bam -
```

## Key Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| -p | 1 | Number of threads |
| -x | - | Index basename |
| --rna-strandness | unstranded | FR/RF/F/R |
| --dta | off | Downstream transcriptome assembly |
| --dta-cufflinks | off | For Cufflinks |
| --min-intronlen | 20 | Minimum intron length |
| --max-intronlen | 500000 | Maximum intron length |
| -k | 5 | Max alignments to report |

## For StringTie/Cufflinks

```bash
# Use --dta for StringTie
hisat2 -p 8 -x hisat2_index \
    --dta \
    -1 r1.fq.gz -2 r2.fq.gz | \
    samtools sort -@ 4 -o aligned.sorted.bam -
```

## Alignment Summary

```bash
# HISAT2 prints summary to stderr
hisat2 -p 8 -x hisat2_index -1 r1.fq.gz -2 r2.fq.gz -S aligned.sam 2> summary.txt
```

Example:
```
50000000 reads; of these:
  50000000 (100.00%) were paired; of these:
    2500000 (5.00%) aligned concordantly 0 times
    45000000 (90.00%) aligned concordantly exactly 1 time
    2500000 (5.00%) aligned concordantly >1 times
95.00% overall alignment rate
```

## Memory Comparison

| Aligner | Human Genome Memory |
|---------|-------------------|
| STAR | ~30GB |
| HISAT2 | ~8GB |

## Related Skills

- read-alignment/star-alignment - Alternative with more features
- rna-quantification/featurecounts-counting - Count aligned reads
- rna-quantification/alignment-free-quant - Skip alignment entirely
- differential-expression/deseq2-basics - Downstream DE analysis

Overview

This skill aligns RNA-seq reads with HISAT2, a memory-efficient splice-aware aligner suited to gene expression workflows. It provides command patterns for indexing (with optional splice/exon annotation), single- and paired-end alignment, piping to samtools for sorted BAM output, and strategies for stranded libraries and two-pass junction discovery. Use this when STAR’s memory needs are prohibitive or when you need a compact, junction-aware aligner.

How this skill works

The skill shows how to build a HISAT2 index from a reference FASTA, optionally incorporating splice site and exon lists extracted from a GTF to improve alignment of spliced reads. It demonstrates direct alignment commands for paired and single reads, streaming output into samtools to produce sorted and indexed BAMs, and flags for stranded libraries, novel splice discovery, and read-group tagging. It also describes a manual two-pass workflow: discover novel junctions across samples, merge them, and realign using the combined junction set to boost splice sensitivity.

When to use it

  • When STAR requires more memory than available (HISAT2 is much lighter).
  • For standard RNA-seq mapping where splice-aware alignment is needed.
  • When you want to discover novel splice junctions and reuse them across samples.
  • When preparing BAMs for downstream tools like featureCounts, StringTie, or htseq-count.
  • For workflows that require read-group information or specific strandedness handling.

Best practices

  • Build the index with splice_sites and exons from a trusted GTF to improve alignments.
  • Pipe HISAT2 output directly to samtools sort to avoid large intermediate SAM files.
  • Run a two-pass approach across samples to increase novel junction sensitivity when studying unannotated splicing.
  • Include appropriate --rna-strandness and read-group (--rg) tags for accurate quantification and metadata.
  • Check tool versions (hisat2, samtools) and adapt CLI flags if versions differ; capture HISAT2 stderr for the alignment summary.

Example use cases

  • Align paired-end human RNA-seq on a machine with ~8 GB RAM using a HISAT2 index built with GTF-derived splice sites.
  • Produce coordinate-sorted BAMs for featureCounts or name-sorted BAMs for htseq-count by piping to samtools with the appropriate sort mode.
  • Run a manual two-pass pipeline: discover novel splice junctions in pass 1, merge them, and realign all samples in pass 2.
  • Detect novel splicing events with --novel-splicesite-outfile and feed the combined catalog back with --novel-splicesite-infile.
  • Prepare alignments for transcript assembly by enabling --dta when planning to run StringTie or Cufflinks downstream.

FAQ

How much memory does HISAT2 need compared to STAR?

HISAT2 typically requires around 8 GB for a human genome index, whereas STAR usually needs on the order of ~30 GB.

Should I always build the index with splice and exon files?

It is recommended when you have a reliable annotation: adding splice_sites and exons improves spliced alignment sensitivity and accuracy.