home / skills / gptomics / bioskills / peptide-identification

peptide-identification skill

/proteomics/peptide-identification

This skill identifies peptides from MS/MS spectra by database search, spectral matching, and FDR estimation using target-decoy approaches.

npx playbooks add skill gptomics/bioskills --skill peptide-identification

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
3.8 KB
---
name: bio-proteomics-peptide-identification
description: Peptide-spectrum matching and protein identification from MS/MS data. Use when identifying peptides from tandem mass spectra. Covers database searching, spectral library matching, and FDR estimation using target-decoy approaches.
tool_type: mixed
primary_tool: pyOpenMS
---

## Version Compatibility

Reference examples tested with: MSnbase 2.28+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Peptide Identification

**"Identify peptides from my MS/MS spectra"** → Match tandem mass spectra against a protein database to identify peptide sequences, then control false discovery rate using target-decoy competition.
- Python: `pyopenms` for in-memory database search and PSM handling
- CLI: `comet`, `MSFragger`, `X!Tandem` for high-throughput database searching
- R: `MSnbase::readMSData()` for importing search results

## Database Search with pyOpenMS

**Goal:** Identify peptide sequences from tandem mass spectra by matching against a protein database.

**Approach:** Load a FASTA database, perform in-silico tryptic digestion to generate theoretical peptides, then match experimental spectra against theoretical fragment ion patterns to identify peptide-spectrum matches (PSMs).

```python
from pyopenms import MSExperiment, MzMLFile, FASTAFile, ProteaseDigestion
from pyopenms import ModificationsDB, AASequence

# Load FASTA database
fasta_entries = []
FASTAFile().load('uniprot_human.fasta', fasta_entries)

# In-silico digestion
digestion = ProteaseDigestion()
digestion.setEnzyme('Trypsin')
digestion.setMissedCleavages(2)

peptides = []
for entry in fasta_entries:
    seq = AASequence.fromString(entry.sequence)
    result = []
    digestion.digest(seq, result)
    peptides.extend([(entry.identifier, str(p)) for p in result])
```

## Working with Search Results (idXML)

```python
from pyopenms import IdXMLFile, ProteinIdentification, PeptideIdentification

protein_ids = []
peptide_ids = []
IdXMLFile().load('search_results.idXML', protein_ids, peptide_ids)

for pep_id in peptide_ids:
    rt = pep_id.getRT()
    mz = pep_id.getMZ()
    for hit in pep_id.getHits():
        sequence = hit.getSequence()
        score = hit.getScore()
        charge = hit.getCharge()
```

## FDR Estimation (Target-Decoy)

```python
def calculate_fdr(scores, is_decoy, score_threshold):
    above_threshold = scores >= score_threshold
    n_target = ((~is_decoy) & above_threshold).sum()
    n_decoy = (is_decoy & above_threshold).sum()
    fdr = n_decoy / n_target if n_target > 0 else 1.0
    return fdr

def find_score_at_fdr(scores, is_decoy, target_fdr=0.01):
    sorted_scores = np.sort(scores)[::-1]
    for threshold in sorted_scores:
        fdr = calculate_fdr(scores, is_decoy, threshold)
        if fdr <= target_fdr:
            return threshold
    return sorted_scores[-1]
```

## R: Search Result Processing

```r
library(MSnbase)

# Read mzIdentML results
psms <- readMzIdData('results.mzid')

# Filter to 1% FDR
psms_filtered <- psms[psms$qvalue <= 0.01, ]

# Unique peptides per protein
peptide_counts <- table(psms_filtered$accession)
```

## Spectral Library Search

```python
from pyopenms import SpectraSTSearchAlgorithm, MSExperiment

# Load spectral library
library = MSExperiment()
MzMLFile().load('spectral_library.mzML', library)

# Match query spectra against library
# Returns similarity scores and library matches
```

## Related Skills

- data-import - Load raw MS data before identification
- protein-inference - Group peptides to proteins
- ptm-analysis - Identify modified peptides

Overview

This skill performs peptide-spectrum matching and protein identification from MS/MS data using database search, spectral library matching, and target-decoy FDR estimation. It provides practical code patterns for in-silico digestion, running searches with pyOpenMS or CLI engines, parsing idXML/mzIdentML results, and computing FDR thresholds. The focus is reproducible, version-aware examples and straightforward result filtering for downstream protein inference.

How this skill works

The skill loads MS/MS spectra and a protein FASTA, generates theoretical peptides via in-silico digestion, and scores experimental spectra against predicted fragment ions or library spectra. It also shows how to read search results (idXML, mzIdentML), extract PSM attributes (retention time, m/z, sequence, score, charge), and apply target-decoy competition to estimate FDR. Utility functions compute score thresholds that meet a desired FDR and filter PSMs for downstream analysis.

When to use it

  • Identifying peptide sequences from tandem mass spectra (shotgun proteomics).
  • Validating search results and filtering PSMs to a target FDR (e.g., 1%).
  • Matching query spectra to a spectral library for faster or identity-rich identifications.
  • Parsing and inspecting search result files (idXML, mzIdentML) for QA or custom downstream processing.
  • Preparing filtered peptide lists for protein inference or PTM analysis.

Best practices

  • Confirm installed package versions and inspect function signatures before running examples (pip show, packageVersion, help).
  • Use in-silico digestion settings that match your experimental enzyme and allowed missed cleavages to reduce false matches.
  • Include a decoy strategy (reversed/shuffled) in the database and use target-decoy competition for FDR estimation.
  • Prefer spectral-library search when high-quality libraries exist; use database search for novel sequences or wide PTM searches.
  • Filter PSMs by score threshold determined empirically with FDR functions rather than arbitrary cutoffs.

Example use cases

  • Run pyOpenMS in-silico tryptic digestion of a Uniprot FASTA and generate candidate peptides for a small-scale proof-of-concept search.
  • Load search_results.idXML and extract top hits per spectrum to inspect scoring distribution and QC metrics (RT, m/z, charge).
  • Compute the score threshold at 1% FDR from a set of PSM scores labeled as target/decoy and filter PSMs for downstream protein grouping.
  • Match experimental mzML spectra against a spectral library to annotate abundant peptides in a DIA or DDA dataset.
  • Read mzIdentML in R via MSnbase to compute unique peptide counts per protein after FDR filtering.

FAQ

What search engines can I use with these patterns?

Use pyOpenMS for in-memory workflows or standard CLI engines like Comet, MSFragger, and X!Tandem for high-throughput searches; adapt parsing to the engine output format.

How do I pick a decoy strategy?

Common choices are sequence reversal or shuffling; ensure decoys mimic target composition and apply target-decoy competition consistently when estimating FDR.