home / skills / gptomics / bioskills / haplotype-phasing

haplotype-phasing skill

/phasing-imputation/haplotype-phasing

This skill helps you phase genotypes into haplotypes using Beagle or SHAPEIT for accurate downstream imputation and haplotype analysis.

npx playbooks add skill gptomics/bioskills --skill haplotype-phasing

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.1 KB
---
name: bio-phasing-imputation-haplotype-phasing
description: Phase genotypes into haplotypes using Beagle or SHAPEIT. Resolves which alleles are inherited together on each chromosome. Use when preparing VCF files for imputation, HLA typing, or population genetic analyses requiring phased haplotypes.
tool_type: cli
primary_tool: beagle
---

## Version Compatibility

Reference examples tested with: bcftools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Haplotype Phasing

**"Phase my genotypes into haplotypes"** → Resolve which alleles are inherited together on each chromosome for downstream imputation, HLA typing, or population genetic analyses.
- CLI: `java -jar beagle.jar gt=input.vcf out=phased` (Beagle 5.4)
- CLI: `shapeit4 --input input.vcf --output phased.vcf` (SHAPEIT)

## Beagle 5.4 Phasing (Recommended)

```bash
# Download Beagle 5.4
wget https://faculty.washington.edu/browning/beagle/beagle.22Jul22.46e.jar

# Basic phasing
java -jar beagle.22Jul22.46e.jar \
    gt=input.vcf.gz \
    out=phased

# Output: phased.vcf.gz (phased genotypes)

# With genetic map (improves accuracy)
java -jar beagle.22Jul22.46e.jar \
    gt=input.vcf.gz \
    map=plink.chr22.GRCh38.map \
    out=phased
```

## Beagle Options

```bash
java -jar beagle.22Jul22.46e.jar \
    gt=input.vcf.gz \
    out=phased \
    map=genetic_map.txt \
    nthreads=8 \
    window=40 \
    overlap=4 \
    ne=20000 \              # Effective population size
    seed=12345              # For reproducibility
```

## Phase Per Chromosome

```bash
# Process each chromosome separately
for chr in {1..22}; do
    java -Xmx16g -jar beagle.jar \
        gt=input.chr${chr}.vcf.gz \
        map=genetic_maps/plink.chr${chr}.GRCh38.map \
        out=phased.chr${chr} \
        nthreads=8
done

# Concatenate chromosomes
bcftools concat phased.chr*.vcf.gz -Oz -o phased.all.vcf.gz
bcftools index phased.all.vcf.gz
```

## SHAPEIT5 Phasing (for Large Datasets)

**Goal:** Phase large biobank-scale datasets using SHAPEIT5's two-stage approach for accurate haplotype resolution of both common and rare variants.

**Approach:** First phase common variants to build a haplotype scaffold, then phase rare variants onto that scaffold using the common-variant structure as a guide.

```bash
# Phase common variants first
shapeit5_phase_common \
    --input input.vcf.gz \
    --map genetic_map.txt \
    --output phased_common.bcf \
    --thread 8 \
    --log phased.log

# Then phase rare variants
shapeit5_phase_rare \
    --input input.vcf.gz \
    --scaffold phased_common.bcf \
    --map genetic_map.txt \
    --output phased.bcf \
    --thread 8
```

## SHAPEIT5 with Reference Panel

```bash
# Improves phasing using reference haplotypes
shapeit5_phase_common \
    --input input.vcf.gz \
    --reference reference_panel.bcf \
    --map genetic_map.txt \
    --output phased.bcf \
    --thread 8
```

## Beagle with Reference Panel

```bash
# Use reference panel for better phasing
java -jar beagle.22Jul22.46e.jar \
    gt=input.vcf.gz \
    ref=reference.vcf.gz \
    map=genetic_map.txt \
    out=phased \
    nthreads=8
```

## Input Preparation

```bash
# Filter variants before phasing
bcftools view -m2 -M2 -v snps input.vcf.gz -Oz -o biallelic_snps.vcf.gz

# Remove missing genotypes (optional)
bcftools view -g ^miss biallelic_snps.vcf.gz -Oz -o no_missing.vcf.gz

# Normalize (important!)
bcftools norm -f reference.fa -Oz -o normalized.vcf.gz input.vcf.gz
```

## Check Phasing Results

```bash
# View phased genotypes (| instead of /)
bcftools query -f '%CHROM\t%POS\t[%GT\t]\n' phased.vcf.gz | head

# Unphased: 0/1
# Phased: 0|1 or 1|0

# Count phased vs unphased
bcftools query -f '[%GT\n]' phased.vcf.gz | grep -c '|'
```

## Genetic Maps

```bash
# Download genetic maps (GRCh38)
wget https://faculty.washington.edu/browning/beagle/genetic_maps/plink.GRCh38.map.zip
unzip plink.GRCh38.map.zip

# Format: chromosome position rate(cM/Mb) genetic_position(cM)
# chr1 55550 2.981822 0.000000
```

## Key Parameters

| Parameter | Beagle | SHAPEIT5 | Description |
|-----------|--------|----------|-------------|
| Threads | nthreads | --thread | CPU threads |
| Window | window | --window | Analysis window size |
| Eff. pop size | ne | --effective-size | For LD modeling |
| Seed | seed | --seed | Random seed |

## Memory Requirements

| Dataset Size | Beagle Memory | SHAPEIT5 Memory |
|--------------|--------------|-----------------|
| 1,000 samples | 8 GB | 4 GB |
| 10,000 samples | 32 GB | 16 GB |
| 100,000 samples | 64+ GB | 32 GB |

## Phasing Accuracy Metrics

- **Switch error rate**: Rate of phase switches vs truth
- **Mismatch error rate**: Overall haplotype differences
- Measure using trio data or known haplotypes

## Related Skills

- phasing-imputation/genotype-imputation - Impute after phasing
- phasing-imputation/reference-panels - Get reference data
- variant-calling/filtering-best-practices - Prepare input VCF
- population-genetics/linkage-disequilibrium - LD analysis

Overview

This skill phases genotype VCFs into haplotypes using Beagle or SHAPEIT family tools to resolve which alleles are inherited together on each chromosome. It provides recommended command patterns, options for reference-panel or genetic-map assisted phasing, and practical steps for per-chromosome processing and output checks. Use it to prepare data for imputation, HLA typing, or population-genetic analyses that require phased haplotypes.

How this skill works

The skill runs phasing engines (Beagle 5.4 or SHAPEIT5) on normalized, filtered VCF/BCF inputs to produce phased outputs (| notation). It supports using genetic maps and reference panels to improve accuracy, per-chromosome batching for parallelization, and two-stage SHAPEIT5 workflows for large datasets (common scaffold then rare-variant phasing). It also includes example commands to inspect and validate phased genotypes.

When to use it

  • Preparing VCFs for genotype imputation workflows
  • HLA typing workflows that require phased alleles
  • Population genetics analyses that need haplotypes (LD, IBD, selection scans)
  • Large-cohort phasing where memory and speed trade-offs matter
  • When you have a suitable reference panel or genetic map to improve accuracy

Best practices

  • Normalize and filter input VCFs first (biallelic SNPs, remove missing if needed)
  • Phase per chromosome to parallelize and reduce memory load, then concat outputs with bcftools
  • Provide a genetic map and, if available, a reference panel to improve phasing accuracy
  • Set seed and document parameters (nthreads, window, ne) for reproducibility
  • Check outputs for phased genotype notation (0|1 vs 0/1) and compute switch error rates if truth sets exist

Example use cases

  • Run Beagle for a medium cohort with a genetic map: java -jar beagle.jar gt=input.vcf.gz map=genetic_map.txt out=phased nthreads=8
  • Phase biobank-scale data with SHAPEIT5: phase common variants to build a scaffold, then phase rare variants onto that scaffold
  • Phase each chromosome in parallel then concatenate with bcftools concat and index with bcftools index
  • Use a reference panel with Beagle or SHAPEIT5 to improve phasing of rare alleles before downstream imputation
  • Validate phased output by querying GT fields and counting '|' phased genotypes using bcftools query

FAQ

Which tool should I choose: Beagle or SHAPEIT5?

Beagle 5.4 is a solid general-purpose phaser and easy to run; SHAPEIT5 scales better for very large datasets and supports a two-stage common+rare phasing strategy. Choose based on dataset size, memory, and available reference panels.

Do I always need a genetic map or reference panel?

No, phasing works without them but accuracy improves with a genetic map and especially with a matched reference panel. Use maps and panels when available for better phasing of both common and rare variants.

How do I check that phasing succeeded?

Inspect genotype fields with bcftools query and confirm phased GTs contain '|' characters. For accuracy, compare against trio truth data or compute switch error rates when truth is available.