home / skills / gptomics / bioskills / clip-peak-calling

clip-peak-calling skill

safe

This skill helps identify protein-RNA binding sites from CLIP-seq data by applying CLIPper, PureCLIP, or Piranha peak callers.

npx playbooks add skill gptomics/bioskills --skill clip-peak-calling

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

7.3 KB

---
name: bio-clip-seq-clip-peak-calling
description: Call protein-RNA binding site peaks from CLIP-seq data using CLIPper, PureCLIP, or Piranha. Use when identifying RBP binding sites from aligned CLIP reads.
tool_type: cli
primary_tool: CLIPper
---

## Version Compatibility

Reference examples tested with: MACS3 3.0+, bedtools 2.31+, pandas 2.2+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# CLIP-seq Peak Calling

**"Call binding sites from my CLIP-seq data"** → Identify protein-RNA interaction sites from aligned CLIP reads using specialized peak callers that model crosslink-induced signatures.
- CLI: `clipper -b aligned.bam -s hg38 -o peaks.bed` (CLIPper)
- CLI: `pureclip -i aligned.bam -bai aligned.bam.bai -g genome.fa -o sites.bed` (PureCLIP)

## CLIPper (Recommended)

**Goal:** Identify protein-RNA binding sites from aligned CLIP-seq reads.

**Approach:** Run CLIPper on deduplicated BAM with species-specific gene annotations to call peaks with FDR control and superlocal background correction.

```bash
# Basic peak calling
clipper \
    -b deduped.bam \
    -s hg38 \
    -o peaks.bed \
    --save-pickle

# With FDR threshold
clipper \
    -b deduped.bam \
    -s hg38 \
    -o peaks.bed \
    --FDR 0.05 \
    --superlocal

# Specify gene annotations
clipper \
    -b deduped.bam \
    -s hg38 \
    --gene genes.bed \
    -o peaks.bed
```

## CLIPper Options

| Option | Description |
|--------|-------------|
| -b | Input BAM file |
| -s | Species (hg38, mm10) |
| -o | Output BED file |
| --FDR | FDR threshold (default 0.05) |
| --superlocal | Use superlocal background |
| --gene | Custom gene annotation BED |
| --save-pickle | Save intermediate data |

## PureCLIP (HMM-Based)

**Goal:** Identify single-nucleotide crosslink sites using a hidden Markov model that jointly models enrichment and truncation signals.

**Approach:** Run PureCLIP with aligned BAM and genome FASTA to produce both single-nucleotide crosslink sites and broader binding regions, optionally using an input control BAM.

PureCLIP uses an HMM to model crosslink sites, incorporating enrichment and truncation signals.

```bash
# Installation
conda install -c bioconda pureclip

# Basic peak calling
pureclip \
    -i deduped.bam \
    -bai deduped.bam.bai \
    -g genome.fa \
    -o crosslink_sites.bed \
    -or binding_regions.bed \
    -nt 4

# -nt 4: Number of threads. Adjust based on CPU cores.
# -o: Single-nucleotide crosslink sites
# -or: Broader binding regions
```

### PureCLIP Options

| Option | Description |
|--------|-------------|
| -i | Input BAM file |
| -bai | BAM index file |
| -g | Reference genome FASTA |
| -o | Crosslink sites output |
| -or | Binding regions output |
| -nt | Number of threads |
| -iv | Interval file to restrict analysis |
| -dm | Min distance for merging |

### PureCLIP with Input Control

```bash
# With SMInput control BAM
pureclip \
    -i clip.bam \
    -bai clip.bam.bai \
    -g genome.fa \
    -ibam sminput.bam \
    -ibai sminput.bam.bai \
    -o crosslinks.bed \
    -or regions.bed \
    -nt 8

# -ibam/-ibai: Input control BAM for background modeling
```

### PureCLIP Output

```bash
# Crosslink sites BED contains:
# chr start end name score strand

# Score interpretation:
# Higher scores = more confident crosslink sites

# Filter by score
# score>=3: Medium confidence. Use 5+ for high confidence.
awk '$5 >= 3' crosslink_sites.bed > filtered_sites.bed
```

### PureCLIP for Different CLIP Types

```bash
# eCLIP (recommended settings)
pureclip -i eclip.bam -bai eclip.bam.bai -g genome.fa \
    -o sites.bed -or regions.bed -nt 4 -dm 8

# iCLIP (single-nucleotide resolution)
pureclip -i iclip.bam -bai iclip.bam.bai -g genome.fa \
    -o sites.bed -or regions.bed -nt 4

# PAR-CLIP (T-to-C transitions)
pureclip -i parclip.bam -bai parclip.bam.bai -g genome.fa \
    -o sites.bed -or regions.bed -nt 4
```

## Piranha

```bash
# Basic usage
Piranha -s deduped.bam -o peaks.bed

# With p-value threshold
Piranha -s deduped.bam -o peaks.bed -p 0.01

# Stranded analysis
Piranha -s deduped.bam -o peaks.bed -p 0.01 -u

# Zero-truncated negative binomial
Piranha -s deduped.bam -o peaks.bed -d ZeroTruncatedNegativeBinomial
```

## PEAKachu (for PAR-CLIP)

```bash
# PAR-CLIP specific caller
peakachu adaptive \
    -c control.bam \
    -t treatment.bam \
    -r reference.fa \
    -o peakachu_peaks.gff
```

## MACS3 for CLIP (Alternative)

```bash
# Use narrow peak calling mode
macs3 callpeak \
    -t deduped.bam \
    -f BAM \
    -g hs \
    -n clip_peaks \
    --nomodel \
    --extsize 50 \
    -q 0.01
```

## Strand-Specific Peak Calling

**Goal:** Call CLIP-seq peaks separately on each strand for strand-specific binding analysis.

**Approach:** Split the BAM into plus and minus strand reads using samtools flag filtering, call peaks on each strand independently, and merge.

```bash
# Split BAM by strand
samtools view -h -F 16 deduped.bam | samtools view -Sb - > plus_strand.bam
samtools view -h -f 16 deduped.bam | samtools view -Sb - > minus_strand.bam

# Call peaks on each strand
clipper -b plus_strand.bam -s hg38 -o peaks_plus.bed
clipper -b minus_strand.bam -s hg38 -o peaks_minus.bed

# Combine
cat peaks_plus.bed peaks_minus.bed | sort -k1,1 -k2,2n > peaks_all.bed
```

## Filter Peaks

```bash
# By score
awk '$5 >= 10' peaks.bed > peaks_filtered.bed

# By size
awk '($3 - $2) >= 20 && ($3 - $2) <= 200' peaks.bed > peaks_sized.bed

# By read count (if in name field)
awk '$5 >= 5' peaks.bed > peaks_min5reads.bed
```

## Merge Replicates

```bash
# Use bedtools to find consensus peaks
bedtools intersect -a rep1_peaks.bed -b rep2_peaks.bed -wa | \
    sort -u > consensus_peaks.bed

# Require overlap in N replicates
bedtools multiinter -i rep1.bed rep2.bed rep3.bed | \
    awk '$4 >= 2' | \
    bedtools merge > consensus_peaks.bed
```

## Peak Metrics

```python
import pandas as pd

def load_clip_peaks(bed_path):
    peaks = pd.read_csv(bed_path, sep='\t', header=None,
                        names=['chrom', 'start', 'end', 'name', 'score', 'strand'])
    return peaks

def peak_stats(peaks):
    stats = {
        'n_peaks': len(peaks),
        'mean_width': (peaks['end'] - peaks['start']).mean(),
        'median_score': peaks['score'].median(),
        'peaks_per_chrom': peaks.groupby('chrom').size().to_dict()
    }
    return stats

peaks = load_clip_peaks('peaks.bed')
print(peak_stats(peaks))
```

## Quality Metrics

| Metric | Good Value | Description |
|--------|------------|-------------|
| Peak count | 1,000-50,000 | Depends on RBP |
| Peak width | 20-100 nt | Typical for RBP footprint |
| FRiP | >0.1 | Fraction reads in peaks |

## Calculate FRiP

```bash
# Reads in peaks
reads_in_peaks=$(bedtools intersect -a deduped.bam -b peaks.bed -u | samtools view -c -)

# Total reads
total_reads=$(samtools view -c deduped.bam)

# FRiP
frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc)
echo "FRiP: $frip"
```

## Related Skills

- clip-alignment - Generate aligned BAM
- clip-preprocessing - UMI deduplication
- binding-site-annotation - Annotate peaks with gene features
- clip-motif-analysis - Find enriched motifs in peaks

Overview

This skill calls protein–RNA binding site peaks from aligned CLIP-seq reads using CLIPper, PureCLIP, Piranha, and alternatives (MACS3, PEAKachu). It focuses on practical commands, parameter choices, strand-specific handling, filtering, replicate merging, and simple metrics to evaluate peak sets. Use it to convert deduplicated, aligned BAMs into high-confidence binding sites for downstream annotation and motif analysis.

How this skill works

The skill runs specialized peak callers that model CLIP-specific signals: CLIPper for region-level peaks with local background correction, PureCLIP for single-nucleotide crosslink sites using an HMM, and Piranha for count-based peak detection. It covers strand-splitting, control-aware calling, score-based filtering, replicate consensus, and basic quality metrics (peak counts, widths, FRiP). Examples include CLI invocations and light Python utilities to summarize BED outputs.

When to use it

You have deduplicated/aligned CLIP-seq BAMs and want binding-site calls
You need single-nucleotide crosslink sites (use PureCLIP)
You want region-level peaks with FDR control and superlocal background (use CLIPper)
You need strand-specific binding analysis
You want to merge replicates or filter peaks by score/size

Best practices

Verify tool versions and adjust flags to your installed CLI/API signatures
Deduplicate reads and build BAM indices before peak calling
Run strand-specific calls when binding is strand-biased; split BAMs and call per strand
Use input/control BAMs when supported (PureCLIP -ibam) to reduce false positives
Filter peaks by score, width, and read counts and require overlap across replicates for consensus

Example use cases

Call CLIPper peaks on deduped.bam with superlocal background and FDR 0.05
Run PureCLIP to get single-nucleotide crosslink sites and broader binding regions with genome.fa and multithreading
Split BAM by strand, call peaks separately, then merge strand-specific results into a final BED
Filter peaks by score>=10 and width 20–200 nt, then compute FRiP to assess enrichment
Use bedtools multiinter to generate consensus peaks present in at least N replicates

FAQ

Which caller should I choose for single-nucleotide resolution?

PureCLIP is designed for single-nucleotide crosslink sites using an HMM; use CLIPper for broader region-level peaks.

How do I assess peak quality?

Use peak counts, mean peak width, median score, and FRiP. Typical FRiP >0.1 and peak widths 20–100 nt are good starting benchmarks.