home / skills / gptomics / bioskills / binding-site-annotation

binding-site-annotation skill

safe

This skill helps map CLIP-seq binding sites to transcript features such as 3'UTR, 5'UTR, CDS, introns, and ncRNAs in Python.

npx playbooks add skill gptomics/bioskills --skill binding-site-annotation

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

2.0 KB

---
name: bio-clip-seq-binding-site-annotation
description: Annotate CLIP-seq binding sites to genomic features including 3'UTR, 5'UTR, CDS, introns, and ncRNAs. Use when characterizing where an RBP binds in transcripts.
tool_type: mixed
primary_tool: ChIPseeker
---

## Version Compatibility

Reference examples tested with: bedtools 2.31+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Binding Site Annotation

**"Annotate where my RBP binds in transcripts"** → Map CLIP-seq peaks to genomic features (3'UTR, 5'UTR, CDS, introns, ncRNAs) to characterize RNA-binding protein target regions.
- R: `ChIPseeker::annotatePeak()` with transcript annotation databases
- CLI: `bedtools intersect` with gene model BED files

## Using ChIPseeker (R)

**Goal:** Classify CLIP-seq binding sites by genomic feature (3'UTR, 5'UTR, CDS, intron).

**Approach:** Load peaks and a TxDb transcript database, annotate with annotatePeak, and visualize the feature distribution with a pie chart.

```r
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

peaks <- readPeakFile('peaks.bed')
anno <- annotatePeak(peaks, TxDb = txdb)

plotAnnoPie(anno)
```

## Using BEDTools

```bash
# Annotate to UTRs
bedtools intersect -a peaks.bed -b 3utr.bed -wa -wb > peaks_3utr.bed
```

## Python Annotation

```python
import pandas as pd

def annotate_peaks(peaks_bed, annotation_gtf):
    '''Annotate peaks to genomic features'''
    # Load peaks and annotations
    # Intersect and categorize
    pass
```

## Related Skills

- clip-peak-calling - Get peaks
- genome-intervals/interval-arithmetic - Intersect peaks with genomic features

Overview

This skill annotates CLIP-seq binding sites to transcriptomic genomic features such as 3'UTR, 5'UTR, CDS, introns, and noncoding RNAs. It provides simple, reproducible patterns for R, Python, and CLI workflows to classify where an RNA-binding protein (RBP) contacts transcripts. Use it to summarize binding distributions, generate feature-specific peak sets, or prepare inputs for downstream motif and enrichment analyses.

How this skill works

The core approach intersects peak coordinates with a gene model or transcript annotation (TxDb/GTF/BED) and assigns each peak to a feature category. In R, ChIPseeker::annotatePeak uses a TxDb object to return feature annotations and offers quick visualizations. On the command line, bedtools intersect pairs peaks with pre-extracted feature BED files (3'UTR, CDS, intron, etc.). A Python pattern loads peak and annotation tables, performs interval joins, and collapses overlapping feature matches into a primary category per peak.

When to use it

Characterizing transcript regions targeted by an RBP after peak calling
Generating counts of peaks per feature class for publication or QC
Extracting subsets of peaks in UTRs or CDS for motif discovery
Creating input files for downstream differential binding or enrichment testing
Comparing binding profiles across conditions or RBPs

Best practices

Match genome assemblies between peaks and annotation (e.g., hg38 peaks with hg38 TxDb/GTF)
Pre-process gene models into non-overlapping feature BED files or use a consistent prioritization order (e.g., CDS > UTR > intron)
Report how multi-feature overlaps are resolved (first hit, longest overlap, or hierarchical rules)
Verify tool versions (bedtools, pandas, ChIPseeker) and adapt code to the installed API
Validate results with a small subset visually (IGV) to confirm annotations

Example use cases

Run ChIPseeker::annotatePeak on hg38 TxDb to produce a pie chart of feature distribution for an RBP replicate
Use bedtools intersect to create a BED file of peaks overlapping 3'UTRs and feed that to motif-finding software
Write a Python script that annotates peaks with GTF-derived features, collapses multi-feature hits by priority, and outputs a summary table
Compare intron versus 3'UTR binding between control and knockdown samples to infer regulatory shifts
Filter peaks to ncRNA annotations to investigate noncoding transcript targeting

FAQ

How do I handle peaks overlapping multiple features?

Choose a consistent resolution rule: hierarchical priority (e.g., CDS > UTR > intron), longest-overlap assignment, or report multi-labels. Document the rule and use it throughout analyses.

Which annotation source should I use?

Prefer an annotation matched to your reference genome (TxDb for R, GTF from Ensembl/GENCODE). For reproducibility, record the annotation version and source.