home / skills / gptomics / bioskills / alignment-indexing

alignment-indexing skill

/alignment-files/alignment-indexing

This skill helps you enable random access to BAM/CRAM files by creating and using BAI/CSI indices with pysam and samtools.

npx playbooks add skill gptomics/bioskills --skill alignment-indexing

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
6.2 KB
---
name: bio-alignment-indexing
description: Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
tool_type: cli
primary_tool: samtools
---

## Version Compatibility

Reference examples tested with: pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Alignment Indexing

Create indices for random access to alignment files using samtools and pysam.

**"Index a BAM file"** → Create a .bai/.csi index enabling random access to genomic regions.
- CLI: `samtools index file.bam`
- Python: `pysam.index('file.bam')`

## Index Types

| Index | Extension | Use Case |
|-------|-----------|----------|
| BAI | `.bai` | Standard BAM index, chromosomes < 512 Mbp |
| CSI | `.csi` | Large chromosomes, custom bin sizes |
| CRAI | `.crai` | CRAM index |

## samtools index

### Create BAI Index
```bash
samtools index input.bam
# Creates input.bam.bai
```

### Create CSI Index
```bash
samtools index -c input.bam
# Creates input.bam.csi
```

### Specify Output Name
```bash
samtools index input.bam output.bai
```

### Multi-threaded Indexing
```bash
samtools index -@ 4 input.bam
```

### Index CRAM
```bash
samtools index input.cram
# Creates input.cram.crai
```

## Index Requirements

Indexing requires coordinate-sorted files:
```bash
# Check sort order
samtools view -H input.bam | grep "^@HD"
# Should show SO:coordinate

# Sort if needed, then index
samtools sort -o sorted.bam input.bam
samtools index sorted.bam
```

## Using Indices for Region Access

**Goal:** Extract reads overlapping specific genomic coordinates from an indexed BAM.

**Approach:** With the index present, `samtools view` or `pysam.fetch()` can jump directly to the relevant file offset instead of scanning the entire file.

### samtools view with Region
```bash
# Requires index file present
samtools view input.bam chr1:1000000-2000000
```

### Multiple Regions
```bash
samtools view input.bam chr1:1000-2000 chr2:3000-4000
```

### Regions from BED File
```bash
samtools view -L regions.bed input.bam
```

## pysam Python Alternative

### Create Index
```python
import pysam

pysam.index('input.bam')
# Creates input.bam.bai
```

### Create CSI Index
```python
pysam.index('input.bam', 'input.bam.csi', csi=True)
```

### Fetch with Index
```python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
    # fetch() requires index
    for read in bam.fetch('chr1', 1000000, 2000000):
        print(read.query_name)
```

### Check if Indexed
```python
import pysam
from pathlib import Path

def is_indexed(bam_path):
    bam_path = Path(bam_path)
    return (bam_path.with_suffix('.bam.bai').exists() or
            Path(str(bam_path) + '.bai').exists() or
            bam_path.with_suffix('.bam.csi').exists())

if not is_indexed('input.bam'):
    pysam.index('input.bam')
```

### Fetch Multiple Regions
```python
regions = [('chr1', 1000, 2000), ('chr1', 5000, 6000), ('chr2', 1000, 2000)]

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for chrom, start, end in regions:
        count = sum(1 for _ in bam.fetch(chrom, start, end))
        print(f'{chrom}:{start}-{end}: {count} reads')
```

### Count Reads in Region
```python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
    count = bam.count('chr1', 1000000, 2000000)
    print(f'Reads in region: {count}')
```

### Get Reads Covering Position
```python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for read in bam.fetch('chr1', 1000000, 1000001):
        if read.reference_start <= 1000000 < read.reference_end:
            print(f'{read.query_name} covers position 1000000')
```

## Index File Locations

samtools looks for indices in two locations:
```
input.bam.bai   # Standard location
input.bai       # Alternative location
```

For CRAM:
```
input.cram.crai
```

## idxstats - Index Statistics

### Get Per-Chromosome Counts
```bash
samtools idxstats input.bam
```

Output format:
```
chr1    248956422    5000000    0
chr2    242193529    4500000    0
*       0            0          10000
```

Columns: reference name, length, mapped reads, unmapped reads

### Sum Total Mapped Reads
```bash
samtools idxstats input.bam | awk '{sum += $3} END {print sum}'
```

### pysam idxstats
```python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for stat in bam.get_index_statistics():
        print(f'{stat.contig}: {stat.mapped} mapped, {stat.unmapped} unmapped')
```

## FASTA Index (faidx)

Related but different - index reference FASTA for random access:

```bash
samtools faidx reference.fa
# Creates reference.fa.fai

# Fetch region from indexed FASTA
samtools faidx reference.fa chr1:1000-2000
```

### pysam FastaFile
```python
with pysam.FastaFile('reference.fa') as ref:
    seq = ref.fetch('chr1', 1000, 2000)
    print(seq)
```

## Quick Reference

| Task | samtools | pysam |
|------|----------|-------|
| Create BAI | `samtools index file.bam` | `pysam.index('file.bam')` |
| Create CSI | `samtools index -c file.bam` | `pysam.index('file.bam', csi=True)` |
| Fetch region | `samtools view file.bam chr1:1-1000` | `bam.fetch('chr1', 0, 1000)` |
| Count in region | `samtools view -c file.bam chr1:1-1000` | `bam.count('chr1', 0, 1000)` |
| Index stats | `samtools idxstats file.bam` | `bam.get_index_statistics()` |
| Index FASTA | `samtools faidx ref.fa` | Automatic with FastaFile |

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `random alignment retrieval only works for indexed BAM` | Missing index | Run `samtools index file.bam` |
| `file is not sorted` | Unsorted BAM | Sort first with `samtools sort` |
| `chromosome not found` | Wrong chromosome name | Check names with `samtools view -H` |

## Related Skills

- sam-bam-basics - View and convert alignment files
- alignment-sorting - Sort BAM files (required before indexing)
- alignment-filtering - Filter by regions using index
- bam-statistics - Use idxstats for quick counts
- sequence-io/read-sequences - Index FASTA with SeqIO.index_db()

Overview

This skill creates and uses BAI/CSI/CRAI indices for BAM and CRAM files using samtools and pysam to enable fast random access to alignment data. It explains when to choose BAI vs CSI, how to generate indices on the CLI and in Python, and how to fetch reads or compute per-contig statistics using those indices. The guidance covers sorting requirements, index file locations, and common errors to watch for.

How this skill works

Indexing records genomic-file offsets so tools can jump directly to reads that overlap a genomic region rather than scanning the whole file. samtools index and pysam.index produce .bai, .csi, or .crai files depending on format and flags. Once an index exists, samtools view / idxstats or pysam.AlignmentFile.fetch() and get_index_statistics() use the index to retrieve reads and counts quickly.

When to use it

  • Enable random-access queries to BAM/CRAM regions (e.g., view or fetch by chr:start-end).
  • Work with very large chromosomes or assemblies that exceed BAI limits (use CSI).
  • Run region-based read counting or per-contig statistics (idxstats).
  • Prepare files for downstream tools that require indexed inputs (many callers and visualizers).
  • Speed up repeated region queries during interactive analysis or pipelines.

Best practices

  • Ensure files are coordinate-sorted before indexing; sort with samtools sort if needed.
  • Choose CSI for large chromosomes or nonstandard bin sizes; use -c / csi=True accordingly.
  • Keep index next to the alignment file (input.bam.bai or input.bai) so samtools/pysam find it.
  • Verify tool and library versions (samtools and pysam) and adjust calls to match installed APIs.
  • Use multi-threading (-@ N) for faster indexing on large files, and re-index after any file rewrite.

Example use cases

  • Index a sorted BAM with samtools: samtools index sorted.bam, then samtools view sorted.bam chr1:1-1000000.
  • Create a CSI for a large BAM: samtools index -c input.bam or pysam.index('input.bam', 'input.bam.csi', csi=True).
  • Fetch reads from Python: with pysam.AlignmentFile('input.bam','rb') as bam: for r in bam.fetch('chr1',1000,2000): process(r).
  • Compute per-contig mapped counts: samtools idxstats input.bam or bam.get_index_statistics() in pysam.
  • Detect missing index and create it programmatically before region queries using a small is_indexed() helper.

FAQ

What index type should I pick, BAI or CSI?

Use BAI for standard BAMs and typical chromosomes; choose CSI for very large chromosomes or custom binning when BAI limits are exceeded.

My fetch call says the BAM is not indexed. What now?

Confirm the file is coordinate-sorted, then create an index with samtools index or pysam.index and place the .bai/.csi next to the BAM.