home / skills / gptomics / bioskills / data-import

data-import skill

/proteomics/data-import

This skill loads and parses mass spectrometry data formats (mzML, mzXML, MaxQuant outputs) with filtering and missing-value assessment to bootstrap analyses.

npx playbooks add skill gptomics/bioskills --skill data-import

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
4.5 KB
---
name: bio-proteomics-data-import
description: Load and parse mass spectrometry data formats including mzML, mzXML, and quantification tool outputs like MaxQuant proteinGroups.txt. Use when starting a proteomics analysis with raw or processed MS data. Handles contaminant filtering and missing value assessment.
tool_type: mixed
primary_tool: pyOpenMS
---

## Version Compatibility

Reference examples tested with: MSnbase 2.28+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Mass Spectrometry Data Import

**"Load my mass spec data into Python"** → Parse mzML/mzXML raw files or MaxQuant proteinGroups.txt into data structures for programmatic access and downstream analysis.
- Python: `pyopenms.MzMLFile().load()` for raw spectra, `pandas.read_csv()` for search engine outputs
- R: `MSnbase::readMSData()` for raw, `read.delim()` for MaxQuant/Proteome Discoverer

## Loading mzML/mzXML Files with pyOpenMS

**Goal:** Parse raw mass spectrometry data files into memory for programmatic access.

**Approach:** Load mzML/mzXML into an MSExperiment object, then iterate spectra by MS level to access peaks and precursor info.

```python
from pyopenms import MSExperiment, MzMLFile, MzXMLFile

exp = MSExperiment()
MzMLFile().load('sample.mzML', exp)

for spectrum in exp:
    if spectrum.getMSLevel() == 1:
        mz, intensity = spectrum.get_peaks()
    elif spectrum.getMSLevel() == 2:
        precursor = spectrum.getPrecursors()[0]
        precursor_mz = precursor.getMZ()
```

## Loading MaxQuant Output

**Goal:** Import MaxQuant proteinGroups.txt with contaminant and decoy filtering.

**Approach:** Read the TSV file, remove reverse hits, contaminants, and site-only identifications, then extract intensity columns.

```python
import pandas as pd

protein_groups = pd.read_csv('proteinGroups.txt', sep='\t', low_memory=False)

# Filter contaminants and reverse hits
contam_col = 'Potential contaminant' if 'Potential contaminant' in protein_groups.columns else 'Contaminant'
protein_groups = protein_groups[
    (protein_groups.get(contam_col, '') != '+') &
    (protein_groups.get('Reverse', '') != '+') &
    (protein_groups.get('Only identified by site', '') != '+')
]

# Extract intensity columns (LFQ or iBAQ)
intensity_cols = [c for c in protein_groups.columns if c.startswith('LFQ intensity') or c.startswith('iBAQ ')]
if not intensity_cols:
    intensity_cols = [c for c in protein_groups.columns if c.startswith('Intensity ') and 'Intensity L' not in c]
intensities = protein_groups[['Protein IDs', 'Gene names'] + intensity_cols]
```

## Loading Spectronaut/DIA-NN Output

**Goal:** Import DIA-NN long-format report and reshape into a protein-by-sample quantification matrix.

**Approach:** Pivot the report table on protein group and run columns, using MaxLFQ values.

```python
diann_report = pd.read_csv('report.tsv', sep='\t')

# Pivot to protein-level matrix
protein_matrix = diann_report.pivot_table(
    index='Protein.Group', columns='Run', values='PG.MaxLFQ', aggfunc='first'
)
```

## R: Loading with MSnbase

**Goal:** Load raw MS data in R for interactive exploration of spectra and metadata.

**Approach:** Use MSnbase's on-disk reading mode to access spectra and feature metadata without loading all data into memory.

```r
library(MSnbase)

raw_data <- readMSData('sample.mzML', mode = 'onDisk')
spectra <- spectra(raw_data)
header_info <- fData(raw_data)
```

## Missing Value Assessment

**Goal:** Quantify missing value patterns across proteins and samples in an intensity matrix.

**Approach:** Count NaN values per protein and per sample, then compute overall missing percentage.

```python
def assess_missing_values(df, intensity_cols):
    missing_per_protein = df[intensity_cols].isna().sum(axis=1)
    missing_per_sample = df[intensity_cols].isna().sum(axis=0)

    total_missing = df[intensity_cols].isna().sum().sum()
    total_values = df[intensity_cols].size
    missing_pct = 100 * total_missing / total_values

    return {'per_protein': missing_per_protein, 'per_sample': missing_per_sample, 'total_pct': missing_pct}
```

## Related Skills

- quantification - Process imported data for quantification
- peptide-identification - Identify peptides from raw spectra
- expression-matrix/counts-ingest - Similar data loading patterns

Overview

This skill loads and parses common mass spectrometry data formats (mzML, mzXML) and proteomics quantification outputs such as MaxQuant proteinGroups.txt, Spectronaut/DIA-NN reports, and long-format DIA exports. It prepares data for downstream proteomics workflows by handling contaminant/decoy filtering, reshaping reports into protein-by-sample matrices, and assessing missing-value patterns. Use it as the first step when starting a proteomics analysis with raw or processed MS data.

How this skill works

For raw files, the skill uses pyOpenMS to read mzML/mzXML into an MSExperiment and iterates spectra by MS level to access peaks and precursor info. For search-engine or quant tools it uses pandas to read TSV/CSV reports, filter out contaminants/reverse hits/site-only IDs, and extract or pivot intensity columns into a protein-by-sample matrix. It also provides routines to quantify missing values per protein and per sample to guide downstream imputation or filtering decisions.

When to use it

  • When you need to import raw mzML or mzXML spectra into Python for programmatic access
  • When starting an analysis from MaxQuant proteinGroups.txt and you need filtered intensity matrices
  • When working with DIA search outputs (Spectronaut, DIA-NN) and need to pivot long reports into matrices
  • When you want quick contaminant/decoy filtering and identification of intensity columns
  • When you need a missing-value summary to decide filtering or imputation strategies

Best practices

  • Verify installed package versions (pyOpenMS, pandas, MSnbase) and adapt examples to the package API on your system
  • Prefer on-disk reading for large raw files to avoid memory exhaustion (MSnbase in R or pyOpenMS streaming patterns)
  • Always remove contaminants, reverse hits, and site-only IDs before quantification steps
  • Detect and standardize intensity column names (LFQ, iBAQ, Intensity) before downstream aggregation
  • Assess missing-value patterns per-protein and per-sample before normalization or imputation

Example use cases

  • Load sample.mzML to inspect MS1 and MS2 peaks and extract precursor m/z for targeted reprocessing
  • Read MaxQuant proteinGroups.txt, filter contaminants/reverse hits, and produce an LFQ intensity matrix for differential analysis
  • Pivot a DIA-NN long-format report into a protein-by-run PG.MaxLFQ matrix for statistical testing
  • Run a missing-value assessment to report percent missingness by protein and sample and guide filtering thresholds

FAQ

Which columns identify contaminants in MaxQuant files?

Common columns are 'Potential contaminant' or 'Contaminant' and 'Reverse' for decoys; check both names and filter rows marked '+'.

What tool should I use for very large mzML files?

Use on-disk reading (MSnbase in R) or streaming patterns in pyOpenMS; avoid loading full files into memory if data are large.