Foundation model for mass spectrometry proteomics
- URL: http://arxiv.org/abs/2505.10848v2
- Date: Mon, 19 May 2025 03:28:14 GMT
- Title: Foundation model for mass spectrometry proteomics
- Authors: Justin Sanders, Melih Yilmaz, Jacob H. Russell, Wout Bittremieux, William E. Fondrie, Nicholas M. Riley, Sewoong Oh, William Stafford Noble,
- Abstract summary: We propose unifying various spectrum prediction tasks under a single foundation model for mass spectra.<n>We show that using these pre-trained spectrum representations improves our performance on the four downstream tasks.
- Score: 22.385489678681907
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments.
Related papers
- SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars [1.4217538206528657]
We present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis.<n>By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications.<n>We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar- parameter estimation and chemical-abundance determination.
arXiv Detail & Related papers (2025-07-02T17:49:52Z) - CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis [75.25966323298003]
Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding.<n> variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies.<n>We introduce $textbfCARL$, a model for $textbfC$amera-$textbfA$gnostic $textbfR$esupervised $textbfL$ across RGB, multispectral, and hyperspectral imaging modalities.
arXiv Detail & Related papers (2025-04-27T13:06:40Z) - A Self-supervised Learning Method for Raman Spectroscopy based on Masked Autoencoders [3.9517125314802306]
We propose a self-supervised learning paradigm for Raman spectroscopy based on a Masked AutoEncoder, termed SMAE.<n> SMAE does not require any spectral annotations during pre-training. By randomly masking and then reconstructing the spectral information, the model learns essential spectral features.
arXiv Detail & Related papers (2025-04-21T10:44:06Z) - Deep Learning Domain Adaptation to Understand Physico-Chemical Processes from Fluorescence Spectroscopy Small Datasets: Application to Ageing of Olive Oil [4.14360329494344]
Fluorescence spectroscopy is a fundamental tool in life sciences and chemistry, widely used for applications such as environmental monitoring, food quality control, and biomedical diagnostics.
Analysis of spectroscopic data with deep learning, in particular of fluorescence excitation-emission matrices (EEMs), presents significant challenges due to the typically small and sparse datasets available.
This study proposes a new approach that exploits domain adaptation with pretrained vision models, alongside a novel interpretability algorithm to address these challenges.
arXiv Detail & Related papers (2024-06-14T13:41:21Z) - Efficient Prediction of Peptide Self-assembly through Sequential and
Graphical Encoding [57.89530563948755]
This work provides a benchmark analysis of peptide encoding with advanced deep learning models.
It serves as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
arXiv Detail & Related papers (2023-07-17T00:43:33Z) - Spectrum-BERT: Pre-training of Deep Bidirectional Transformers for
Spectral Classification of Chinese Liquors [0.0]
We propose a pre-training method of deep bidirectional transformers for spectral classification of Chinese liquors, abbreviated as Spectrum-BERT.
We elaborately design two pre-training tasks, Next Curve Prediction (NCP) and Masked Curve Model (MCM), so that the model can effectively utilize unlabeled samples.
In the comparative experiments, the proposed Spectrum-BERT significantly outperforms the baselines in multiple metrics.
arXiv Detail & Related papers (2022-10-22T13:11:25Z) - Trustworthiness of Laser-Induced Breakdown Spectroscopy Predictions via
Simulation-based Synthetic Data Augmentation and Multitask Learning [4.633997895806144]
We consider quantitative analyses of spectral data using laser-induced breakdown spectroscopy.
We address the small size of training data available, and the validation of the predictions during inference on unknown data.
arXiv Detail & Related papers (2022-10-07T18:00:09Z) - Unsupervised Machine Learning for Exploratory Data Analysis of Exoplanet
Transmission Spectra [68.8204255655161]
We focus on unsupervised techniques for analyzing spectral data from transiting exoplanets.
We show that there is a high degree of correlation in the spectral data, which calls for appropriate low-dimensional representations.
We uncover interesting structures in the principal component basis, namely, well-defined branches corresponding to different chemical regimes.
arXiv Detail & Related papers (2022-01-07T22:26:33Z) - A probabilistic deep learning approach to automate the interpretation of
multi-phase diffraction spectra [4.240899165468488]
We develop an ensemble convolutional neural network trained on simulated diffraction spectra to identify complex multi-phase mixtures.
Our model is benchmarked on simulated and experimentally measured diffraction spectra, showing exceptional performance with accuracies exceeding those given by previously reported methods.
arXiv Detail & Related papers (2021-03-30T20:13:01Z) - Revisiting the Sample Complexity of Sparse Spectrum Approximation of
Gaussian Processes [60.479499225746295]
We introduce a new scalable approximation for Gaussian processes with provable guarantees which hold simultaneously over its entire parameter space.
Our approximation is obtained from an improved sample complexity analysis for sparse spectrum Gaussian processes (SSGPs)
arXiv Detail & Related papers (2020-11-17T05:41:50Z) - Spectral Analysis Network for Deep Representation Learning and Image
Clustering [53.415803942270685]
This paper proposes a new network structure for unsupervised deep representation learning based on spectral analysis.
It can identify the local similarities among images in patch level and thus more robust against occlusion.
It can learn more clustering-friendly representations and is capable to reveal the deep correlations among data samples.
arXiv Detail & Related papers (2020-09-11T05:07:15Z) - Meta-learning framework with applications to zero-shot time-series
forecasting [82.61728230984099]
This work provides positive evidence using a broad meta-learning framework.
residual connections act as a meta-learning adaptation mechanism.
We show that it is viable to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining.
arXiv Detail & Related papers (2020-02-07T16:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.