Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics
- URL: http://arxiv.org/abs/2508.21076v1
- Date: Tue, 12 Aug 2025 20:39:50 GMT
- Title: Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics
- Authors: Hao Xu, Zhichao Wang, Shengqi Sang, Pisit Wajanasara, Nuno Bandeira,
- Abstract summary: Pep2Prob is the first comprehensive dataset and benchmark designed for peptide-specific ion probability prediction.<n>Dataset contains fragment ion probability statistics for 608,780 precursors is a pair of peptide sequence and charge state.
- Score: 11.487378569927499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS$^2$) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS$^2$ analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS$^2$ spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.
Related papers
- Optimizing Mirror-Image Peptide Sequence Design for Data Storage via Peptide Bond Cleavage Prediction [10.510705826988952]
Mirror-image peptides composed of D-amino acids have emerged as a promising biological storage medium.<n>This study is the first to propose improving sequencing accuracy indirectly by optimizing the design of mirror-image peptide sequences.
arXiv Detail & Related papers (2025-10-29T14:40:13Z) - PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs [88.98041407783502]
PRING is the first benchmark that evaluates protein-protein interaction prediction from a graph-level perspective.<n> PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions.
arXiv Detail & Related papers (2025-07-07T15:21:05Z) - Foundation model for mass spectrometry proteomics [22.385489678681907]
We propose unifying various spectrum prediction tasks under a single foundation model for mass spectra.<n>We show that using these pre-trained spectrum representations improves our performance on the four downstream tasks.
arXiv Detail & Related papers (2025-05-16T04:40:07Z) - MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction [65.33218256339151]
Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome.
Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs.
We introduce the MeToken model, which tokenizes the micro-environment of each acid, integrating both sequence and structural information into unified discrete tokens.
arXiv Detail & Related papers (2024-11-04T07:14:28Z) - Peptide Sequencing Via Protein Language Models [0.0]
We introduce a protein language model for determining the complete sequence of a peptide based on measurement of a limited set of amino acids.
Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify.
We achieve per-amino-acid accuracy up to 90.5% when only four amino acids are known.
arXiv Detail & Related papers (2024-08-01T20:12:49Z) - NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics [58.03989832372747]
We present the first unified benchmark NovoBench for emphde novo peptide sequencing.
It comprises diverse mass spectrum data, integrated models, and comprehensive evaluation metrics.
Recent methods, including DeepNovo, PointNovo, Casanovo, InstaNovo, AdaNovo and $pi$-HelixNovo are integrated into our framework.
arXiv Detail & Related papers (2024-06-16T08:23:21Z) - AdaNovo: Adaptive \emph{De Novo} Peptide Sequencing with Conditional Mutual Information [46.23980841020632]
We propose AdaNovo, a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide.
AdaNovo excels in identifying amino acids with post-translational modifications (PTMs) and exhibits robustness against data noise.
arXiv Detail & Related papers (2024-03-09T11:54:58Z) - Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry [1.338778493151964]
We introduce DiaTrans, a deep-learning model based on transformer architecture.
It deciphers peptide sequences from DIA mass spectrometry data.
Our results show significant improvements over existing STOA methods.
arXiv Detail & Related papers (2024-02-17T19:04:23Z) - PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for
Efficient and Generalizable Compound-Protein Interaction Prediction [63.50967073653953]
Compound-Protein Interaction prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery.
Existing deep learning-based methods utilize only the single modality of protein sequences or structures.
We propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction.
arXiv Detail & Related papers (2024-02-13T03:51:10Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Efficient Prediction of Peptide Self-assembly through Sequential and
Graphical Encoding [57.89530563948755]
This work provides a benchmark analysis of peptide encoding with advanced deep learning models.
It serves as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
arXiv Detail & Related papers (2023-07-17T00:43:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.