home / skills / gptomics / bioskills / alignment-validation

alignment-validation skill

/alignment-files/alignment-validation

This skill validates alignment quality and post-alignment metrics to ensure reliable variant calling and quantification.

npx playbooks add skill gptomics/bioskills --skill alignment-validation

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
10.1 KB
---
name: bio-alignment-validation
description: Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification.
tool_type: mixed
primary_tool: samtools
---

## Version Compatibility

Reference examples tested with: matplotlib 3.8+, numpy 1.26+, picard 3.1+, pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Alignment Validation

Post-alignment quality control to verify alignment quality and identify issues.

**"Check alignment quality"** → Compute post-alignment QC metrics (mapping rate, pairing, insert size, strand balance) to identify issues before downstream analysis.
- CLI: `samtools flagstat`, `samtools stats`, Picard `CollectAlignmentSummaryMetrics`
- Python: `pysam.AlignmentFile` iteration with metric calculations

## Insert Size Distribution

**Goal:** Verify that the fragment length distribution matches the library preparation protocol.

**Approach:** Extract template_length from properly paired reads and compare the distribution to expected values for the library type.

### samtools stats

```bash
samtools stats input.bam > stats.txt
grep "^IS" stats.txt | cut -f2,3 > insert_sizes.txt
```

### Picard CollectInsertSizeMetrics

```bash
java -jar picard.jar CollectInsertSizeMetrics \
    I=input.bam \
    O=insert_metrics.txt \
    H=insert_histogram.pdf
```

### Expected Insert Sizes

| Library Type | Expected Size |
|--------------|---------------|
| Standard WGS | 300-500 bp |
| PCR-free | 350-550 bp |
| RNA-seq | 150-300 bp |
| ChIP-seq | 150-300 bp |
| ATAC-seq | Multimodal |

### Python Insert Size Analysis

```python
import pysam
import numpy as np
import matplotlib.pyplot as plt

def get_insert_sizes(bam_file, max_reads=100000):
    sizes = []
    bam = pysam.AlignmentFile(bam_file, 'rb')
    for i, read in enumerate(bam.fetch()):
        if i >= max_reads:
            break
        if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
            sizes.append(read.template_length)
    bam.close()
    return sizes

sizes = get_insert_sizes('sample.bam')
print(f'Median insert size: {np.median(sizes):.0f}')
print(f'Mean insert size: {np.mean(sizes):.0f}')
print(f'Std dev: {np.std(sizes):.0f}')

plt.hist(sizes, bins=100, range=(0, 1000))
plt.xlabel('Insert Size')
plt.ylabel('Count')
plt.savefig('insert_size_dist.pdf')
```

## Proper Pairing Rate

Percentage of reads correctly paired.

### samtools flagstat

```bash
samtools flagstat input.bam

samtools flagstat input.bam | grep "properly paired"
```

### Calculate Pairing Rate

```bash
proper=$(samtools view -c -f 2 input.bam)
mapped=$(samtools view -c -F 4 input.bam)
rate=$(echo "scale=4; $proper / $mapped * 100" | bc)
echo "Proper pairing rate: ${rate}%"
```

### Expected Rates

| Metric | Good | Marginal | Poor |
|--------|------|----------|------|
| Proper pair | > 90% | 80-90% | < 80% |
| Mapped | > 95% | 90-95% | < 90% |
| Singletons | < 5% | 5-10% | > 10% |

## GC Bias

GC content correlation with coverage.

### Picard CollectGcBiasMetrics

```bash
java -jar picard.jar CollectGcBiasMetrics \
    I=input.bam \
    O=gc_bias_metrics.txt \
    CHART=gc_bias_chart.pdf \
    S=gc_summary.txt \
    R=reference.fa
```

### deepTools computeGCBias

```bash
computeGCBias \
    -b input.bam \
    --effectiveGenomeSize 2913022398 \
    -g hg38.2bit \
    -o gc_bias.txt \
    --biasPlot gc_bias.pdf
```

### Interpret GC Bias

| Issue | Symptom |
|-------|---------|
| Under-representation | Low GC coverage drops |
| Over-representation | High GC coverage elevated |
| PCR bias | Strong correlation |

## Strand Balance

Forward and reverse strand should be balanced.

### Calculate Strand Ratio

```bash
forward=$(samtools view -c -F 16 input.bam)
reverse=$(samtools view -c -f 16 input.bam)
echo "Forward: $forward"
echo "Reverse: $reverse"
ratio=$(echo "scale=4; $forward / $reverse" | bc)
echo "F/R ratio: $ratio"
```

### Check Strand Bias per Chromosome

```bash
for chr in chr1 chr2 chr3; do
    fwd=$(samtools view -c -F 16 input.bam $chr)
    rev=$(samtools view -c -f 16 input.bam $chr)
    echo "$chr: F=$fwd R=$rev ratio=$(echo "scale=2; $fwd/$rev" | bc)"
done
```

## Mapping Quality Distribution

### Extract MAPQ Distribution

```bash
samtools view input.bam | cut -f5 | sort -n | uniq -c | sort -k2 -n
```

### Calculate Mean MAPQ

```bash
samtools view input.bam | awk '{sum+=$5; count++} END {print "Mean MAPQ:", sum/count}'
```

### MAPQ Thresholds

| MAPQ | Meaning |
|------|---------|
| 0 | Multi-mapper |
| 1-10 | Low confidence |
| 20-30 | Moderate |
| 40+ | High confidence |
| 60 | Unique (BWA) |

## Chromosome Coverage Balance

### Calculate Per-Chromosome Coverage

```bash
samtools idxstats input.bam | awk '{print $1, $3/$2}' | head -25
```

### Check for Aneuploidy/Contamination

```bash
samtools idxstats input.bam | awk '$3 > 0 {
    sum += $3
    len[$1] = $2
    reads[$1] = $3
} END {
    for (chr in reads) {
        expected = len[chr] / sum * reads[chr]
        ratio = reads[chr] / expected
        if (ratio < 0.8 || ratio > 1.2) print chr, ratio
    }
}'
```

## Mismatch Rate

### Picard CollectAlignmentSummaryMetrics

```bash
java -jar picard.jar CollectAlignmentSummaryMetrics \
    I=input.bam \
    R=reference.fa \
    O=alignment_summary.txt
```

### Key Metrics

| Metric | Description | Good Value |
|--------|-------------|------------|
| PCT_PF_READS_ALIGNED | Mapped % | > 95% |
| PF_MISMATCH_RATE | Mismatches | < 1% |
| PF_INDEL_RATE | Indels | < 0.1% |
| STRAND_BALANCE | Strand ratio | ~0.5 |

## Comprehensive Validation Script

**Goal:** Run all key alignment QC checks in a single pass and generate a summary report.

**Approach:** Combine samtools flagstat, stats, idxstats, and strand counts into one script that outputs pass/warn/fail calls.

```bash
#!/bin/bash
BAM=$1
REF=$2
NAME=$(basename $BAM .bam)
OUTDIR=${3:-qc}

mkdir -p $OUTDIR

echo "=== Alignment Validation: $NAME ===" | tee $OUTDIR/report.txt

echo -e "\n--- Flagstat ---" | tee -a $OUTDIR/report.txt
samtools flagstat $BAM | tee -a $OUTDIR/report.txt

echo -e "\n--- Mapping Rate ---" | tee -a $OUTDIR/report.txt
mapped=$(samtools view -c -F 4 $BAM)
total=$(samtools view -c $BAM)
rate=$(echo "scale=2; $mapped / $total * 100" | bc)
echo "Mapping rate: ${rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Proper Pairing ---" | tee -a $OUTDIR/report.txt
proper=$(samtools view -c -f 2 $BAM)
pair_rate=$(echo "scale=2; $proper / $mapped * 100" | bc)
echo "Proper pairing: ${pair_rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Insert Size ---" | tee -a $OUTDIR/report.txt
samtools stats $BAM | grep "insert size average" | tee -a $OUTDIR/report.txt

echo -e "\n--- Strand Balance ---" | tee -a $OUTDIR/report.txt
fwd=$(samtools view -c -F 16 $BAM)
rev=$(samtools view -c -f 16 $BAM)
strand_ratio=$(echo "scale=3; $fwd / $rev" | bc)
echo "Forward: $fwd, Reverse: $rev, Ratio: $strand_ratio" | tee -a $OUTDIR/report.txt

echo -e "\n--- Chromosome Coverage ---" | tee -a $OUTDIR/report.txt
samtools idxstats $BAM | head -25 | tee -a $OUTDIR/report.txt

echo -e "\nReport: $OUTDIR/report.txt"
```

## Python Validation Module

```python
import pysam
import numpy as np
from collections import Counter

class AlignmentValidator:
    def __init__(self, bam_file):
        self.bam = pysam.AlignmentFile(bam_file, 'rb')

    def mapping_rate(self):
        stats = self.bam.get_index_statistics()
        mapped = sum(s.mapped for s in stats)
        unmapped = sum(s.unmapped for s in stats)
        return mapped / (mapped + unmapped) * 100

    def proper_pair_rate(self, sample_size=100000):
        proper = 0
        paired = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if read.is_paired:
                paired += 1
                if read.is_proper_pair:
                    proper += 1
        return proper / paired * 100 if paired > 0 else 0

    def mapq_distribution(self, sample_size=100000):
        mapqs = []
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                mapqs.append(read.mapping_quality)
        return Counter(mapqs)

    def strand_balance(self, sample_size=100000):
        forward = 0
        reverse = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                if read.is_reverse:
                    reverse += 1
                else:
                    forward += 1
        return forward / (forward + reverse) if (forward + reverse) > 0 else 0.5

    def report(self):
        print(f'Mapping rate: {self.mapping_rate():.1f}%')
        print(f'Proper pairing: {self.proper_pair_rate():.1f}%')
        print(f'Strand balance: {self.strand_balance():.3f}')

        mapq_dist = self.mapq_distribution()
        high_qual = sum(v for k, v in mapq_dist.items() if k >= 30)
        total = sum(mapq_dist.values())
        print(f'High MAPQ (>=30): {high_qual/total*100:.1f}%')

    def close(self):
        self.bam.close()

validator = AlignmentValidator('sample.bam')
validator.report()
validator.close()
```

## Quality Thresholds Summary

| Metric | Good | Warning | Fail |
|--------|------|---------|------|
| Mapping rate | > 95% | 90-95% | < 90% |
| Proper pairing | > 90% | 80-90% | < 80% |
| Duplicate rate | < 10% | 10-20% | > 20% |
| Strand balance | 0.48-0.52 | 0.45-0.55 | Outside |
| Mean MAPQ | > 40 | 30-40 | < 30 |
| GC bias | < 1.2x | 1.2-1.5x | > 1.5x |

## Related Skills

- bam-statistics - Basic flagstat and depth
- duplicate-handling - Mark/remove duplicates
- alignment-filtering - Filter low-quality reads
- chip-seq/chipseq-qc - ChIP-specific metrics

Overview

This skill validates post-alignment quality by computing key metrics like insert size distribution, proper pairing rate, GC bias, strand balance, mapping quality, chromosome coverage, and mismatch rates. It bundles CLI recipes (samtools, Picard, deepTools) and Python examples (pysam) to produce pass/warn/fail calls and summary reports. Use it to gate data before variant calling, quantification, or downstream QC-sensitive analyses.

How this skill works

The skill inspects BAM alignment files to extract metrics: template_length for insert sizes, SAM flags for proper pairing and strand counts, MAPQ values for mapping quality, idxstats for per-chromosome balance, and Picard/deepTools outputs for GC bias and mismatch rates. It provides both one-shot shell scripts that call samtools and Picard and a lightweight Python validator class that samples reads with pysam and computes summarizing statistics and distributions.

When to use it

  • Before variant calling or expression quantification
  • After alignment to verify library prep expectations (insert size, pairing)
  • When investigating unexpected coverage or suspected contamination
  • When automated pipelines need QC gating
  • Before publishing or sharing alignment data

Best practices

  • Confirm tool and library versions (samtools, Picard, pysam, matplotlib) before running examples
  • Sample a fixed number of reads for fast estimates, then run full metrics for final reports
  • Compare insert size and GC bias to expected library-type ranges
  • Flag marginal metrics (mapping rate 90–95%, proper pairs 80–90%) and inspect alignments interactively
  • Use per-chromosome strand and coverage checks to find contamination or aneuploidy

Example use cases

  • Run a consolidated bash validator that outputs mapping rate, proper pairing, insert size summary, strand ratio, and idxstats report
  • Use the Python AlignmentValidator to get quick mapping rate, proper pair %, strand balance, and MAPQ distribution for CI checks
  • Generate Picard CollectInsertSizeMetrics and GC bias charts for formal QC reports
  • Iteratively debug a low-call-rate variant pipeline by checking MAPQ distribution and PF_MISMATCH_RATE
  • Detect sample swaps or contamination by comparing per-chromosome coverage ratios to expectations

FAQ

What thresholds should I use to pass or fail an alignment?

Use the provided thresholds as guidance: mapping rate >95% good, proper pairing >90% good, mean MAPQ >40 good, strand balance ~0.5. Treat 80–95% ranges as warnings requiring inspection.

How many reads should I sample for Python-based checks?

Start with 50k–100k reads for fast, stable estimates. Increase to full-file scans for final reports if compute time permits.