home / skills / gptomics / bioskills / chipseq-qc

chipseq-qc skill

/chip-seq/chipseq-qc

This skill evaluates ChIP-seq data quality by computing FRiP, NSC/RSC, library complexity, and IDR concordance to guide downstream analysis.

npx playbooks add skill gptomics/bioskills --skill chipseq-qc

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
10.7 KB
---
name: bio-chipseq-qc
description: ChIP-seq quality control metrics including FRiP (Fraction of Reads in Peaks), cross-correlation analysis (NSC/RSC), library complexity, and IDR (Irreproducibility Discovery Rate) for replicate concordance. Use to assess experiment quality before downstream analysis. Use when assessing ChIP-seq data quality metrics.
tool_type: mixed
primary_tool: deepTools
---

## Version Compatibility

Reference examples tested with: MACS3 3.0+, Subread 2.0+, bedtools 2.31+, deepTools 3.5+, pybedtools 0.9+, pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# ChIP-seq Quality Control

**"Assess the quality of my ChIP-seq experiment"** → Compute FRiP, cross-correlation (NSC/RSC), library complexity, and IDR replicate concordance to evaluate enrichment success.
- CLI: `deeptools plotFingerprint`, `phantompeakqualtools run_spp.R`
- Python: `pysam` + `pybedtools` for custom QC metrics

Quality metrics for assessing ChIP-seq experiment success and replicate reproducibility.

## FRiP (Fraction of Reads in Peaks)

**Goal:** Quantify enrichment strength by measuring the proportion of reads falling within called peaks.

**Approach:** Count reads overlapping peak regions and divide by total mapped reads.

### Calculate FRiP with bedtools

```bash
# Count reads in peaks
reads_in_peaks=$(bedtools intersect -a chip.bam -b peaks.narrowPeak -u | samtools view -c -)
total_reads=$(samtools view -c -F 260 chip.bam)

# Calculate FRiP
frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc)
echo "FRiP: $frip"
```

### Calculate FRiP with featureCounts

```bash
# Convert peaks to SAF format
awk 'BEGIN{OFS="\t"} {print $4, $1, $2, $3, "."}' peaks.narrowPeak > peaks.saf

# Count reads in peaks
featureCounts -a peaks.saf -F SAF -o peak_counts.txt chip.bam

# FRiP from summary
grep -v "^#" peak_counts.txt.summary
```

### Calculate FRiP with pysam

```python
import pysam
import pybedtools

def calculate_frip(bam_file, peak_file):
    bam = pysam.AlignmentFile(bam_file, 'rb')
    total_reads = bam.count(read_callback=lambda r: not r.is_unmapped and not r.is_secondary)

    peaks = pybedtools.BedTool(peak_file)
    reads_in_peaks = 0
    for peak in peaks:
        reads_in_peaks += bam.count(peak.chrom, peak.start, peak.end)

    frip = reads_in_peaks / total_reads
    return frip

frip = calculate_frip('chip.bam', 'peaks.narrowPeak')
print(f'FRiP: {frip:.4f}')
```

### FRiP Thresholds

| Target | Minimum FRiP | Good FRiP |
|--------|--------------|-----------|
| TF (narrow) | 0.01 | > 0.05 |
| Histone (broad) | 0.10 | > 0.20 |
| H3K4me3 | 0.05 | > 0.15 |
| H3K27ac | 0.05 | > 0.10 |

## Cross-Correlation Analysis (NSC/RSC)

**Goal:** Assess ChIP enrichment quality by measuring strand cross-correlation signal.

**Approach:** Calculate correlation between forward and reverse strand read coverage at varying shifts to detect fragment-length enrichment.

### Run phantompeakqualtools

```bash
# Run SPP cross-correlation analysis
Rscript run_spp.R \
    -c=chip.bam \
    -savp=chip_cc.pdf \
    -out=chip_cc.txt \
    -odir=qc/

# Output columns:
# 1: filename
# 2: numReads
# 3: estFragLen (estimated fragment length)
# 4: corr_estFragLen
# 5: phantomPeak
# 6: corr_phantomPeak
# 7: argmin_corr (minimum strand shift)
# 8: min_corr
# 9: NSC (Normalized Strand Coefficient)
# 10: RSC (Relative Strand Coefficient)
# 11: QualityTag
```

### Interpret NSC and RSC

```bash
# Parse results
awk -F'\t' '{
    print "Fragment length:", $3
    print "NSC:", $9
    print "RSC:", $10
    print "Quality:", $11
}' chip_cc.txt
```

### NSC/RSC Thresholds

| Metric | Marginal | Acceptable | Ideal |
|--------|----------|------------|-------|
| NSC | < 1.05 | 1.05 - 1.1 | > 1.1 |
| RSC | < 0.8 | 0.8 - 1.0 | > 1.0 |
| QualityTag | -2 | 0 | 1 or 2 |

### Plot Cross-Correlation in R

```r
library(spp)

chip_data <- read.bam.tags('chip.bam')
binding_characteristics <- get.binding.characteristics(chip_data, srange=c(50, 500), bin=5)

# Cross-correlation plot
pdf('cc_plot.pdf')
plot(binding_characteristics$cross.correlation, type='l',
     xlab='Strand shift', ylab='Cross-correlation')
abline(v=binding_characteristics$peak$x, col='red')
dev.off()

# Extract metrics
print(paste('Fragment length:', binding_characteristics$peak$x))
```

## Library Complexity (NRF, PBC1, PBC2)

**Goal:** Detect PCR amplification artifacts by measuring library complexity metrics.

**Approach:** Calculate the fraction of unique reads and positional redundancy to assess PCR bottlenecking.

### Calculate with bedtools

```bash
# NRF: Non-Redundant Fraction (unique reads / total reads)
total=$(samtools view -c -F 260 chip.bam)
unique=$(samtools view -F 260 chip.bam | cut -f1-4 | sort -u | wc -l)
nrf=$(echo "scale=4; $unique / $total" | bc)
echo "NRF: $nrf"

# PBC1: PCR Bottleneck Coefficient 1 (M1/Mdistinct)
# M1 = locations with exactly 1 read
# Mdistinct = distinct genomic locations

bedtools bamtobed -i chip.bam | \
    awk '{print $1":"$2"-"$3}' | \
    sort | uniq -c | \
    awk '{
        if($1==1) m1++
        mdist++
    } END {
        print "M1:", m1
        print "Mdistinct:", mdist
        print "PBC1:", m1/mdist
    }'
```

### Library Complexity Thresholds

| Metric | Severe | Mild | None |
|--------|--------|------|------|
| NRF | < 0.5 | 0.5 - 0.8 | > 0.8 |
| PBC1 | < 0.5 | 0.5 - 0.8 | > 0.8 |
| PBC2 | < 1 | 1 - 3 | > 3 |

## IDR (Irreproducibility Discovery Rate)

**Goal:** Assess replicate concordance by measuring consistency of ranked peak lists.

**Approach:** Compare signal-ranked peaks from two replicates using IDR statistical framework to identify reproducible peaks.

### Run IDR Analysis

```bash
# Call peaks on each replicate
macs3 callpeak -t rep1.bam -c input.bam -n rep1 -g hs
macs3 callpeak -t rep2.bam -c input.bam -n rep2 -g hs

# Sort by signal value (column 7)
sort -k7,7nr rep1_peaks.narrowPeak > rep1_sorted.narrowPeak
sort -k7,7nr rep2_peaks.narrowPeak > rep2_sorted.narrowPeak

# Run IDR
idr --samples rep1_sorted.narrowPeak rep2_sorted.narrowPeak \
    --input-file-type narrowPeak \
    --rank signal.value \
    --output-file idr_output.txt \
    --plot idr_plot.pdf \
    --log-output-file idr.log
```

### IDR with Pooled Peaks

```bash
# Call peaks on pooled data
samtools merge -f pooled.bam rep1.bam rep2.bam
macs3 callpeak -t pooled.bam -c input.bam -n pooled -g hs

# Run IDR with oracle
idr --samples rep1_sorted.narrowPeak rep2_sorted.narrowPeak \
    --peak-list pooled_peaks.narrowPeak \
    --input-file-type narrowPeak \
    --rank signal.value \
    --output-file idr_oracle.txt
```

### Interpret IDR Results

```bash
# Count peaks at different IDR thresholds
awk '$5 >= 540' idr_output.txt | wc -l  # IDR < 0.05 (conservative)
awk '$5 >= 415' idr_output.txt | wc -l  # IDR < 0.1 (optimal)

# IDR output columns:
# 1-3: chr, start, end
# 4: name
# 5: scaled IDR (-125 * log2(IDR))
# 6: strand
# 7: signal (from rep1)
# 8: signal (from rep2)
# 9: local IDR
# 10: global IDR
```

### IDR Self-Consistency Check

```bash
# Split one sample and check self-consistency
samtools view -s 0.5 chip.bam -b > pseudo_rep1.bam
samtools view -s 2.5 chip.bam -b > pseudo_rep2.bam

# Call peaks on pseudo-replicates
macs3 callpeak -t pseudo_rep1.bam -c input.bam -n pseudo1 -g hs
macs3 callpeak -t pseudo_rep2.bam -c input.bam -n pseudo2 -g hs

# Run IDR
idr --samples pseudo1_peaks.narrowPeak pseudo2_peaks.narrowPeak \
    --input-file-type narrowPeak \
    --output-file self_idr.txt
```

### IDR Quality Guidelines

| Comparison | Expected IDR Peaks | Notes |
|------------|-------------------|-------|
| True replicates | > 70% of pooled | Biological concordance |
| Pseudo-replicates | > 80% of sample | Technical consistency |
| Rep vs Pooled | ~100% of rep peaks | Subset relationship |

## deepTools QC Metrics

**Goal:** Visualize ChIP enrichment and sample correlation using deepTools fingerprint and correlation plots.

**Approach:** Generate cumulative read coverage curves and pairwise sample correlation matrices from BAM files.

### plotFingerprint

```bash
# Assess enrichment with fingerprint plot
plotFingerprint \
    -b chip.bam input.bam \
    --labels ChIP Input \
    -o fingerprint.pdf \
    --outRawCounts fingerprint.tab \
    --outQualityMetrics fingerprint_qc.txt

# Good ChIP shows curve shifted right of diagonal
# Input follows diagonal
```

### computeMatrix and plotProfile

```bash
# TSS enrichment
computeMatrix reference-point \
    -S chip.bw \
    -R genes.bed \
    --referencePoint TSS \
    -a 3000 -b 3000 \
    -o matrix.gz

plotProfile \
    -m matrix.gz \
    -o tss_enrichment.pdf \
    --perGroup
```

### plotCorrelation

```bash
# Sample correlation
multiBamSummary bins \
    -b rep1.bam rep2.bam rep3.bam \
    -o results.npz

plotCorrelation \
    -in results.npz \
    --corMethod spearman \
    --whatToPlot heatmap \
    -o correlation.pdf \
    --outFileCorMatrix correlation.tab
```

## Complete QC Pipeline

**Goal:** Run all major ChIP-seq QC metrics in a single automated script.

**Approach:** Combine FRiP, cross-correlation, library complexity, and fingerprint analysis into one pipeline.

```bash
#!/bin/bash
sample=$1
input=$2
peaks=$3

echo "=== ChIP-seq QC Report: $sample ===" > qc_report.txt

# FRiP
reads_in_peaks=$(bedtools intersect -a $sample -b $peaks -u | samtools view -c -)
total_reads=$(samtools view -c -F 260 $sample)
frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc)
echo "FRiP: $frip" >> qc_report.txt

# Cross-correlation
Rscript run_spp.R -c=$sample -out=cc.txt -odir=.
nsc=$(cut -f9 cc.txt)
rsc=$(cut -f10 cc.txt)
echo "NSC: $nsc" >> qc_report.txt
echo "RSC: $rsc" >> qc_report.txt

# Library complexity
unique=$(samtools view -F 260 $sample | cut -f1-4 | sort -u | wc -l)
nrf=$(echo "scale=4; $unique / $total_reads" | bc)
echo "NRF: $nrf" >> qc_report.txt

# Fingerprint
plotFingerprint -b $sample $input -o fingerprint.pdf --outQualityMetrics fingerprint_qc.txt

cat qc_report.txt
```

## QC Summary Table

| Metric | Tool | Ideal Value |
|--------|------|-------------|
| FRiP | bedtools/featureCounts | > 0.05 (TF), > 0.1 (histone) |
| NSC | phantompeakqualtools | > 1.1 |
| RSC | phantompeakqualtools | > 1.0 |
| NRF | samtools/bedtools | > 0.8 |
| PBC1 | bedtools | > 0.8 |
| IDR (replicates) | idr | > 70% concordance |

## Related Skills

- peak-calling - Call peaks before QC analysis
- alignment-files - BAM statistics and filtering
- differential-binding - Compare conditions after QC
- atac-seq/atac-qc - Similar QC for ATAC-seq

Overview

This skill provides a compact ChIP-seq quality-control toolkit that computes standard metrics: FRiP, cross-correlation (NSC/RSC), library complexity (NRF, PBC1/PBC2) and IDR replicate concordance. It packages CLI and Python patterns to assess enrichment strength and replicate reproducibility before downstream analysis. Use it to generate QC summaries, plots, and thresholds-based guidance for decision making.

How this skill works

The skill inspects mapped BAM files and peak calls to calculate FRiP by counting reads in peak regions relative to total mapped reads. It runs strand cross-correlation (via SPP/phantompeakqualtools) to estimate fragment length and compute NSC/RSC. Library complexity is derived from unique mapping positions to compute NRF and PBC metrics. For replicate concordance it runs the IDR framework on ranked peak lists and reports reproducible peak counts and thresholds.

When to use it

  • Right after alignment and peak calling to verify enrichment before differential or downstream analysis.
  • When you have biological replicates and need an objective reproducibility check (IDR).
  • To diagnose PCR bottlenecking or low-complexity libraries during QC.
  • When comparing ChIP samples to matching input controls to confirm signal over background.
  • Before investing compute/time in analyses like motif discovery or differential binding.

Best practices

  • Call peaks per replicate and on pooled data, then run IDR to identify reproducible sites.
  • Filter BAMs for properly mapped, non-secondary reads (e.g., samtools -F 260) before counting.
  • Use bedtools/featureCounts or pysam+pybedtools consistently to compute FRiP for comparability.
  • Run phantompeakqualtools (SPP) with a sensible shift range (e.g., 50–500 bp) to get reliable NSC/RSC.
  • Compare QC metrics to recommended thresholds (FRiP, NSC/RSC, NRF/PBC) and re-sequence or reprocess if severe failures occur.

Example use cases

  • Generate a one-file QC report (FRiP, NSC, RSC, NRF) for each ChIP replicate prior to peak calling.
  • Assess whether two biological replicates are concordant using IDR and produce a reproducible peak set for downstream analysis.
  • Diagnose a low-enrichment TF ChIP by combining FRiP and fingerprint plots to distinguish experiment failure from broad signal.
  • Detect PCR bottlenecking in low-input libraries by computing NRF and PBC metrics and deciding on library re-prep.
  • Automate QC in a pipeline: compute fingerprint plots (deepTools), cross-correlation, FRiP, and IDR in one pass.

FAQ

What FRiP values indicate success?

Expect >0.05 for narrow TFs and >0.1 for many histone marks; use experiment-specific thresholds (see H3K4me3/H3K27ac examples).

Which tools produce NSC/RSC?

phantompeakqualtools / SPP are standard; they output estimated fragment length, NSC, RSC and a quality tag.

When should I run IDR?

Run IDR on paired replicate peak lists after calling peaks separately; use pooled peaks as a reference for an oracle analysis if needed.