home / skills / gptomics / bioskills / colocalization-analysis

colocalization-analysis skill

Q: What PP.H4 cutoff indicates colocalization?

PP.H4 > 0.8 is commonly used as strong evidence for a shared causal variant; 0.5–0.8 is suggestive and may need more data or SuSiE.

safe

/causal-genomics/colocalization-analysis

This skill guides users through Bayesian colocalization to assess if GWAS and eQTL signals share a causal variant, with actionable interpretation.

npx playbooks add skill gptomics/bioskills --skill colocalization-analysis

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

8.6 KB

---
name: bio-causal-genomics-colocalization-analysis
description: Test whether two traits share a causal variant at a genomic locus using Bayesian colocalization with coloc. Computes posterior probabilities for shared vs distinct causal variants between GWAS and eQTL signals. Use when determining if a GWAS signal and an eQTL share the same causal variant.
tool_type: r
primary_tool: coloc
---

## Version Compatibility

Reference examples tested with: ggplot2 3.5+

Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Colocalization Analysis

**"Test whether my GWAS signal and eQTL share the same causal variant"** → Compute Bayesian posterior probabilities for five colocalization hypotheses (no association, trait-1-only, trait-2-only, distinct causal variants, shared causal variant) to distinguish true causal overlap from LD-driven coincidence.
- R: `coloc::coloc.abf()` for approximate Bayes factor colocalization

## Overview

Colocalization tests whether two association signals at the same locus are driven by the
same causal variant. This distinguishes shared causality from coincidental overlap due to LD.

Five hypotheses tested by coloc:
- H0: No association with either trait
- H1: Association with trait 1 only
- H2: Association with trait 2 only
- H3: Both associated, different causal variants
- H4: Both associated, shared causal variant

## coloc.abf Analysis

**Goal:** Test whether two traits share a causal variant at a GWAS locus using Bayesian colocalization.

**Approach:** Format summary statistics for each trait as named lists, run coloc.abf to compute posterior probabilities for five hypotheses (H0-H4), and interpret PP.H4 as evidence for a shared causal variant.

```r
library(coloc)

# --- Input format: named list with GWAS summary stats ---
# Required fields: beta, varbeta, snp, position, type, N
# type = 'quant' (continuous) or 'cc' (case-control)

gwas_data <- list(
  beta = gwas_df$BETA,
  varbeta = gwas_df$SE^2,
  snp = gwas_df$SNP,
  position = gwas_df$POS,
  type = 'cc',           # Case-control study
  s = 0.3,               # Proportion of cases (required for cc)
  N = 50000              # Total sample size
)

eqtl_data <- list(
  beta = eqtl_df$BETA,
  varbeta = eqtl_df$SE^2,
  snp = eqtl_df$SNP,
  position = eqtl_df$POS,
  type = 'quant',        # Quantitative trait (expression)
  N = 500,               # eQTL sample size
  sdY = 1                # SD of trait (1 if already normalized)
)

# --- Run colocalization ---
result <- coloc.abf(dataset1 = gwas_data, dataset2 = eqtl_data)

# Posterior probabilities
# PP.H4 > 0.8: Strong evidence for colocalization (shared variant)
# PP.H3 > 0.8: Distinct causal variants at the locus
# PP.H4 between 0.5-0.8: Suggestive but inconclusive
print(result$summary)
```

## Prior Sensitivity

```r
# Default priors: p1 = 1e-4, p2 = 1e-4, p12 = 1e-5
# p1: Prior probability a SNP is associated with trait 1
# p2: Prior probability a SNP is associated with trait 2
# p12: Prior probability a SNP is associated with both traits
#
# Ratio p12/p1 represents prior belief in colocalization
# Default: p12/p1 = 0.1 (10% of trait 1 SNPs also affect trait 2)

result_sensitive <- coloc.abf(
  dataset1 = gwas_data,
  dataset2 = eqtl_data,
  p1 = 1e-4,
  p2 = 1e-4,
  p12 = 5e-6    # More conservative prior for shared association
)

# Sensitivity analysis across prior values
sensitivity(result, 'H4 > 0.8')
```

## Using P-values (No Beta/SE)

```r
# When only p-values are available, use MAF to approximate
gwas_pval <- list(
  pvalues = gwas_df$P,
  MAF = gwas_df$MAF,
  snp = gwas_df$SNP,
  position = gwas_df$POS,
  type = 'cc',
  s = 0.3,
  N = 50000
)

result <- coloc.abf(dataset1 = gwas_pval, dataset2 = eqtl_data)
```

## SuSiE-Coloc (Multiple Causal Variants)

**Goal:** Test colocalization at loci with multiple independent causal signals.

**Approach:** Run SuSiE fine-mapping on each dataset to identify credible sets, then test colocalization between all pairs of credible sets using coloc.susie.

```r
library(coloc)
library(susieR)

# coloc.abf assumes a single causal variant per locus
# SuSiE-coloc handles multiple causal variants

# LD matrix required (correlation matrix from reference panel)
ld_matrix <- as.matrix(read.table('ld_matrix.txt'))

# Run SuSiE on each dataset
susie_gwas <- runsusie(
  list(beta = gwas_df$BETA, varbeta = gwas_df$SE^2,
       snp = gwas_df$SNP, position = gwas_df$POS,
       type = 'cc', s = 0.3, N = 50000, LD = ld_matrix),
  L = 10       # Max number of causal variants to search for
)

susie_eqtl <- runsusie(
  list(beta = eqtl_df$BETA, varbeta = eqtl_df$SE^2,
       snp = eqtl_df$SNP, position = eqtl_df$POS,
       type = 'quant', N = 500, sdY = 1, LD = ld_matrix),
  L = 10
)

# Coloc using SuSiE credible sets
result_susie <- coloc.susie(susie_gwas, susie_eqtl)
print(result_susie$summary)
# Each row tests colocalization between a pair of credible sets
# hit1, hit2: Credible set indices from dataset 1 and 2
```

## HyPrColoc (Multi-Trait)

**Goal:** Test colocalization across three or more traits simultaneously to identify shared causal variant clusters.

**Approach:** Provide beta and SE matrices (SNPs x traits) to hyprcoloc, which clusters traits sharing a causal variant using a branch-and-bound algorithm.

```r
# install.packages('remotes')
# remotes::install_github('jrs95/hyprcoloc')
library(hyprcoloc)

# Test colocalization across multiple traits simultaneously
# Input: matrices of betas and SEs (rows = SNPs, columns = traits)
betas <- cbind(gwas_df$BETA, eqtl1_df$BETA, eqtl2_df$BETA)
ses <- cbind(gwas_df$SE, eqtl1_df$SE, eqtl2_df$SE)
colnames(betas) <- colnames(ses) <- c('GWAS', 'eQTL_gene1', 'eQTL_gene2')
rownames(betas) <- rownames(ses) <- gwas_df$SNP

result_hypr <- hyprcoloc(
  effect.est = betas,
  effect.se = ses,
  trait.names = colnames(betas),
  snp.id = rownames(betas)
)

# Output: clusters of traits sharing a causal variant
print(result_hypr$results)
```

## Input Preparation

```r
# --- Extract a locus (1 Mb window around lead SNP) ---
extract_locus <- function(sumstats, lead_snp_pos, chr, window = 500000) {
  locus <- sumstats[sumstats$CHR == chr &
                    sumstats$POS >= (lead_snp_pos - window) &
                    sumstats$POS <= (lead_snp_pos + window), ]
  locus[order(locus$POS), ]
}

# --- Generate LD matrix from plink ---
# plink --bfile ref_panel --chr 6 --from-bp 30000000 --to-bp 31000000 \
#   --r square --out ld_matrix
# Read into R:
ld <- as.matrix(read.table('ld_matrix.ld'))
```

## Visualization

```r
library(ggplot2)

plot_coloc_locus <- function(gwas_df, eqtl_df, result) {
  pp4 <- round(result$summary['PP.H4.abf'], 3)

  p1 <- ggplot(gwas_df, aes(x = POS / 1e6, y = -log10(P))) +
    geom_point(alpha = 0.6) +
    labs(x = 'Position (Mb)', y = '-log10(P)', title = paste('GWAS | PP.H4 =', pp4)) +
    theme_minimal()

  p2 <- ggplot(eqtl_df, aes(x = POS / 1e6, y = -log10(P))) +
    geom_point(alpha = 0.6, color = 'steelblue') +
    labs(x = 'Position (Mb)', y = '-log10(P)', title = 'eQTL') +
    theme_minimal()

  library(patchwork)
  p1 / p2
}
```

## LocusCompare Plot

```r
# LocusCompare: scatter of -log10(P) for GWAS vs eQTL at shared SNPs
plot_locuscompare <- function(gwas_df, eqtl_df) {
  merged <- merge(
    gwas_df[, c('SNP', 'P')],
    eqtl_df[, c('SNP', 'P')],
    by = 'SNP', suffixes = c('.gwas', '.eqtl')
  )

  ggplot(merged, aes(x = -log10(P.gwas), y = -log10(P.eqtl))) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = 'lm', se = FALSE, linetype = 'dashed', color = 'grey50') +
    labs(x = '-log10(P) GWAS', y = '-log10(P) eQTL', title = 'LocusCompare') +
    theme_minimal()
}
```

## Decision Framework

```
PP.H4 > 0.8:  Strong colocalization -- traits share a causal variant
PP.H3 > 0.8:  Distinct causal variants -- LD-driven overlap, not shared causality
PP.H4 0.5-0.8: Suggestive -- increase sample size, try SuSiE-coloc
PP.H0/H1/H2 dominant: Insufficient signal at this locus
```

Common pitfalls:
- Small eQTL sample sizes reduce power (N < 200 is problematic)
- LD can inflate PP.H3 when two nearby causal variants exist -- use SuSiE-coloc
- Always check that both traits have significant signals at the locus before running coloc

## Related Skills

- mendelian-randomization - Test causal effects using genetic instruments
- fine-mapping - Identify causal variants and credible sets
- population-genetics/linkage-disequilibrium - LD reference panels for SuSiE-coloc
- differential-expression/deseq2-basics - Generate eQTL data for colocalization

Overview

This skill runs Bayesian colocalization analyses to test whether a GWAS signal and an eQTL share the same causal variant at a genomic locus. It computes posterior probabilities for five hypotheses (no association, trait1-only, trait2-only, distinct causal variants, shared causal variant) using coloc and related tools. Results guide whether overlap is true shared causality or coincidental LD. The skill also includes workflows for SuSiE-coloc (multiple causal variants) and multi-trait HyPrColoc.

How this skill works

Prepare locus-level summary statistics for each trait (beta and SE or p-values with MAF), then format inputs as the coloc datasets and run coloc.abf to get PP.H0-PP.H4. For loci with multiple signals, perform SuSiE fine-mapping with an LD matrix and run coloc.susie to test credible-set pairs. For multi-trait scenarios, supply matrices of effect estimates and SEs to hyprcoloc to detect clusters of traits sharing a causal variant. Sensitivity analysis over priors (p1, p2, p12) and visualization (locus plots, LocusCompare) are included to aid interpretation.

When to use it

You have summary statistics for a GWAS locus and an eQTL for the same region and want to test shared causality.
You need to distinguish LD-driven coincidence from a true shared causal variant.
Locus shows association in both traits and you want probabilistic evidence (PP.H4) for colocalization.
You suspect multiple causal variants and require SuSiE-coloc to handle multilocus signals.
You want to compare colocalization across three or more traits (HyPrColoc).

Best practices

Ensure both datasets contain the same SNP set and genomic coordinates; extract a consistent locus window (e.g., ±500 kb).
Use beta and SE when available; if only p-values are present provide MAF to approximate effect sizes.
Provide an appropriate LD matrix from a matched reference panel when running SuSiE; mismatched LD biases results.
Run prior sensitivity (vary p12) and interpret PP.H4 thresholds: >0.8 strong, 0.5–0.8 suggestive.
Check sample sizes and power; small eQTL cohorts (N < 200) reduce sensitivity and increase inconclusive results.

Example use cases

Test whether a schizophrenia GWAS hit and brain eQTL for a nearby gene are driven by the same variant using coloc.abf.
Run SuSiE on GWAS and eQTL summary stats with an LD matrix to resolve multiple causal signals and assess per-credible-set colocalization.
Use hyprcoloc to find clusters of immune traits and eQTLs that share a single causal variant across several tissues.
Visualize locus-level association and LocusCompare plots to validate coloc results before functional follow-up.
Perform prior sensitivity to confirm that strong PP.H4 is robust to alternative p12 settings.

FAQ

What PP.H4 cutoff indicates colocalization?

PP.H4 > 0.8 is commonly used as strong evidence for a shared causal variant; 0.5–0.8 is suggestive and may need more data or SuSiE.

When should I use SuSiE-coloc instead of coloc.abf?

Use SuSiE-coloc when there is evidence of multiple independent causal signals in a locus or when PP.H3 is high, and provide an LD matrix from a matching reference panel.