home / skills / gptomics / bioskills / clip-peak-calling

clip-peak-calling skill

/clip-seq/clip-peak-calling

This skill helps identify protein-RNA binding sites from CLIP-seq data by applying CLIPper, PureCLIP, or Piranha peak callers.

npx playbooks add skill gptomics/bioskills --skill clip-peak-calling

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
7.3 KB
---
name: bio-clip-seq-clip-peak-calling
description: Call protein-RNA binding site peaks from CLIP-seq data using CLIPper, PureCLIP, or Piranha. Use when identifying RBP binding sites from aligned CLIP reads.
tool_type: cli
primary_tool: CLIPper
---

## Version Compatibility

Reference examples tested with: MACS3 3.0+, bedtools 2.31+, pandas 2.2+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# CLIP-seq Peak Calling

**"Call binding sites from my CLIP-seq data"** → Identify protein-RNA interaction sites from aligned CLIP reads using specialized peak callers that model crosslink-induced signatures.
- CLI: `clipper -b aligned.bam -s hg38 -o peaks.bed` (CLIPper)
- CLI: `pureclip -i aligned.bam -bai aligned.bam.bai -g genome.fa -o sites.bed` (PureCLIP)

## CLIPper (Recommended)

**Goal:** Identify protein-RNA binding sites from aligned CLIP-seq reads.

**Approach:** Run CLIPper on deduplicated BAM with species-specific gene annotations to call peaks with FDR control and superlocal background correction.

```bash
# Basic peak calling
clipper \
    -b deduped.bam \
    -s hg38 \
    -o peaks.bed \
    --save-pickle

# With FDR threshold
clipper \
    -b deduped.bam \
    -s hg38 \
    -o peaks.bed \
    --FDR 0.05 \
    --superlocal

# Specify gene annotations
clipper \
    -b deduped.bam \
    -s hg38 \
    --gene genes.bed \
    -o peaks.bed
```

## CLIPper Options

| Option | Description |
|--------|-------------|
| -b | Input BAM file |
| -s | Species (hg38, mm10) |
| -o | Output BED file |
| --FDR | FDR threshold (default 0.05) |
| --superlocal | Use superlocal background |
| --gene | Custom gene annotation BED |
| --save-pickle | Save intermediate data |

## PureCLIP (HMM-Based)

**Goal:** Identify single-nucleotide crosslink sites using a hidden Markov model that jointly models enrichment and truncation signals.

**Approach:** Run PureCLIP with aligned BAM and genome FASTA to produce both single-nucleotide crosslink sites and broader binding regions, optionally using an input control BAM.

PureCLIP uses an HMM to model crosslink sites, incorporating enrichment and truncation signals.

```bash
# Installation
conda install -c bioconda pureclip

# Basic peak calling
pureclip \
    -i deduped.bam \
    -bai deduped.bam.bai \
    -g genome.fa \
    -o crosslink_sites.bed \
    -or binding_regions.bed \
    -nt 4

# -nt 4: Number of threads. Adjust based on CPU cores.
# -o: Single-nucleotide crosslink sites
# -or: Broader binding regions
```

### PureCLIP Options

| Option | Description |
|--------|-------------|
| -i | Input BAM file |
| -bai | BAM index file |
| -g | Reference genome FASTA |
| -o | Crosslink sites output |
| -or | Binding regions output |
| -nt | Number of threads |
| -iv | Interval file to restrict analysis |
| -dm | Min distance for merging |

### PureCLIP with Input Control

```bash
# With SMInput control BAM
pureclip \
    -i clip.bam \
    -bai clip.bam.bai \
    -g genome.fa \
    -ibam sminput.bam \
    -ibai sminput.bam.bai \
    -o crosslinks.bed \
    -or regions.bed \
    -nt 8

# -ibam/-ibai: Input control BAM for background modeling
```

### PureCLIP Output

```bash
# Crosslink sites BED contains:
# chr start end name score strand

# Score interpretation:
# Higher scores = more confident crosslink sites

# Filter by score
# score>=3: Medium confidence. Use 5+ for high confidence.
awk '$5 >= 3' crosslink_sites.bed > filtered_sites.bed
```

### PureCLIP for Different CLIP Types

```bash
# eCLIP (recommended settings)
pureclip -i eclip.bam -bai eclip.bam.bai -g genome.fa \
    -o sites.bed -or regions.bed -nt 4 -dm 8

# iCLIP (single-nucleotide resolution)
pureclip -i iclip.bam -bai iclip.bam.bai -g genome.fa \
    -o sites.bed -or regions.bed -nt 4

# PAR-CLIP (T-to-C transitions)
pureclip -i parclip.bam -bai parclip.bam.bai -g genome.fa \
    -o sites.bed -or regions.bed -nt 4
```

## Piranha

```bash
# Basic usage
Piranha -s deduped.bam -o peaks.bed

# With p-value threshold
Piranha -s deduped.bam -o peaks.bed -p 0.01

# Stranded analysis
Piranha -s deduped.bam -o peaks.bed -p 0.01 -u

# Zero-truncated negative binomial
Piranha -s deduped.bam -o peaks.bed -d ZeroTruncatedNegativeBinomial
```

## PEAKachu (for PAR-CLIP)

```bash
# PAR-CLIP specific caller
peakachu adaptive \
    -c control.bam \
    -t treatment.bam \
    -r reference.fa \
    -o peakachu_peaks.gff
```

## MACS3 for CLIP (Alternative)

```bash
# Use narrow peak calling mode
macs3 callpeak \
    -t deduped.bam \
    -f BAM \
    -g hs \
    -n clip_peaks \
    --nomodel \
    --extsize 50 \
    -q 0.01
```

## Strand-Specific Peak Calling

**Goal:** Call CLIP-seq peaks separately on each strand for strand-specific binding analysis.

**Approach:** Split the BAM into plus and minus strand reads using samtools flag filtering, call peaks on each strand independently, and merge.

```bash
# Split BAM by strand
samtools view -h -F 16 deduped.bam | samtools view -Sb - > plus_strand.bam
samtools view -h -f 16 deduped.bam | samtools view -Sb - > minus_strand.bam

# Call peaks on each strand
clipper -b plus_strand.bam -s hg38 -o peaks_plus.bed
clipper -b minus_strand.bam -s hg38 -o peaks_minus.bed

# Combine
cat peaks_plus.bed peaks_minus.bed | sort -k1,1 -k2,2n > peaks_all.bed
```

## Filter Peaks

```bash
# By score
awk '$5 >= 10' peaks.bed > peaks_filtered.bed

# By size
awk '($3 - $2) >= 20 && ($3 - $2) <= 200' peaks.bed > peaks_sized.bed

# By read count (if in name field)
awk '$5 >= 5' peaks.bed > peaks_min5reads.bed
```

## Merge Replicates

```bash
# Use bedtools to find consensus peaks
bedtools intersect -a rep1_peaks.bed -b rep2_peaks.bed -wa | \
    sort -u > consensus_peaks.bed

# Require overlap in N replicates
bedtools multiinter -i rep1.bed rep2.bed rep3.bed | \
    awk '$4 >= 2' | \
    bedtools merge > consensus_peaks.bed
```

## Peak Metrics

```python
import pandas as pd

def load_clip_peaks(bed_path):
    peaks = pd.read_csv(bed_path, sep='\t', header=None,
                        names=['chrom', 'start', 'end', 'name', 'score', 'strand'])
    return peaks

def peak_stats(peaks):
    stats = {
        'n_peaks': len(peaks),
        'mean_width': (peaks['end'] - peaks['start']).mean(),
        'median_score': peaks['score'].median(),
        'peaks_per_chrom': peaks.groupby('chrom').size().to_dict()
    }
    return stats

peaks = load_clip_peaks('peaks.bed')
print(peak_stats(peaks))
```

## Quality Metrics

| Metric | Good Value | Description |
|--------|------------|-------------|
| Peak count | 1,000-50,000 | Depends on RBP |
| Peak width | 20-100 nt | Typical for RBP footprint |
| FRiP | >0.1 | Fraction reads in peaks |

## Calculate FRiP

```bash
# Reads in peaks
reads_in_peaks=$(bedtools intersect -a deduped.bam -b peaks.bed -u | samtools view -c -)

# Total reads
total_reads=$(samtools view -c deduped.bam)

# FRiP
frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc)
echo "FRiP: $frip"
```

## Related Skills

- clip-alignment - Generate aligned BAM
- clip-preprocessing - UMI deduplication
- binding-site-annotation - Annotate peaks with gene features
- clip-motif-analysis - Find enriched motifs in peaks

Overview

This skill calls protein–RNA binding site peaks from aligned CLIP-seq reads using CLIPper, PureCLIP, Piranha, and alternatives (MACS3, PEAKachu). It focuses on practical commands, parameter choices, strand-specific handling, filtering, replicate merging, and simple metrics to evaluate peak sets. Use it to convert deduplicated, aligned BAMs into high-confidence binding sites for downstream annotation and motif analysis.

How this skill works

The skill runs specialized peak callers that model CLIP-specific signals: CLIPper for region-level peaks with local background correction, PureCLIP for single-nucleotide crosslink sites using an HMM, and Piranha for count-based peak detection. It covers strand-splitting, control-aware calling, score-based filtering, replicate consensus, and basic quality metrics (peak counts, widths, FRiP). Examples include CLI invocations and light Python utilities to summarize BED outputs.

When to use it

  • You have deduplicated/aligned CLIP-seq BAMs and want binding-site calls
  • You need single-nucleotide crosslink sites (use PureCLIP)
  • You want region-level peaks with FDR control and superlocal background (use CLIPper)
  • You need strand-specific binding analysis
  • You want to merge replicates or filter peaks by score/size

Best practices

  • Verify tool versions and adjust flags to your installed CLI/API signatures
  • Deduplicate reads and build BAM indices before peak calling
  • Run strand-specific calls when binding is strand-biased; split BAMs and call per strand
  • Use input/control BAMs when supported (PureCLIP -ibam) to reduce false positives
  • Filter peaks by score, width, and read counts and require overlap across replicates for consensus

Example use cases

  • Call CLIPper peaks on deduped.bam with superlocal background and FDR 0.05
  • Run PureCLIP to get single-nucleotide crosslink sites and broader binding regions with genome.fa and multithreading
  • Split BAM by strand, call peaks separately, then merge strand-specific results into a final BED
  • Filter peaks by score>=10 and width 20–200 nt, then compute FRiP to assess enrichment
  • Use bedtools multiinter to generate consensus peaks present in at least N replicates

FAQ

Which caller should I choose for single-nucleotide resolution?

PureCLIP is designed for single-nucleotide crosslink sites using an HMM; use CLIPper for broader region-level peaks.

How do I assess peak quality?

Use peak counts, mean peak width, median score, and FRiP. Typical FRiP >0.1 and peak widths 20–100 nt are good starting benchmarks.