FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics
- URL: http://arxiv.org/abs/2602.22822v1
- Date: Thu, 26 Feb 2026 10:05:01 GMT
- Title: FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics
- Authors: Yunhua Zhong, Yixuan Tang, Yifan Li, Jie Yang, Pan Liu, Jun Xia,
- Abstract summary: The identification and property prediction of chemical molecules is of central importance in the advancement of drug discovery and material science.<n>Deep learning models appear promising for predicting molecular structure spectra, but overall assessment remains challenging.<n>Our contribution is the creation of benchmark framework FlexMS for constructing and evaluating diverse model architectures in mass spectrum prediction.
- Score: 22.314786276794717
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The identification and property prediction of chemical molecules is of central importance in the advancement of drug discovery and material science, where the tandem mass spectrometry technology gives valuable fragmentation cues in the form of mass-to-charge ratio peaks. However, the lack of experimental spectra hinders the attachment of each molecular identification, and thus urges the establishment of prediction approaches for computational models. Deep learning models appear promising for predicting molecular structure spectra, but overall assessment remains challenging as a result of the heterogeneity in methods and the lack of well-defined benchmarks. To address this, our contribution is the creation of benchmark framework FlexMS for constructing and evaluating diverse model architectures in mass spectrum prediction. With its easy-to-use flexibility, FlexMS supports the dynamic construction of numerous distinct combinations of model architectures, while assessing their performance on preprocessed public datasets using different metrics. In this paper, we provide insights into factors influencing performance, including the structural diversity of datasets, hyperparameters like learning rate and data sparsity, pretraining effects, metadata ablation settings and cross-domain transfer learning analysis. This provides practical guidance in choosing suitable models. Moreover, retrieval benchmarks simulate practical identification scenarios and score potential matches based on predicted spectra.
Related papers
- Generative structural elucidation from mass spectra as an iterative optimization problem [23.97077717251806]
We introduce a computational workflow that poses structure elucidation from LC-MS/MS as an iterative optimization problem.<n>We demonstrate chromatography's performance on the NIST'20 and MassSpecGym datasets as both a standalone elucidation pipeline and as a complement to existing inverse models.
arXiv Detail & Related papers (2026-02-07T21:34:38Z) - Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z) - Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models [68.57424628540907]
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets.<n>We introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms.<n>Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance.
arXiv Detail & Related papers (2025-07-12T08:10:10Z) - Spectra-to-Structure and Structure-to-Spectra Inference Across the Periodic Table [49.65586812435899]
XAStruct is a learning-based system capable of both predicting XAS spectra from crystal structures and inferring local structural descriptors from XAS input.<n>XAStruct is trained on a large-scale dataset spanning over 70 elements across the periodic table.
arXiv Detail & Related papers (2025-06-13T15:58:05Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Role of Structural and Conformational Diversity for Machine Learning
Potentials [4.608732256350959]
We investigate the relationship between data biases and model generalization in Quantum Mechanics.
Our results reveal nuanced patterns in generalization metrics.
These findings provide valuable insights and guidelines for QM data generation efforts.
arXiv Detail & Related papers (2023-10-30T19:33:12Z) - Improved prediction of ligand-protein binding affinities by meta-modeling [1.3859669037499769]
We develop a framework to integrate published force-field-based empirical docking and sequence-based deep learning models.
We show that many of our meta-models significantly improve affinity predictions over base models.
Our best meta-models achieve comparable performance to state-of-the-art deep learning tools exclusively based on 3D structures.
arXiv Detail & Related papers (2023-10-05T23:46:45Z) - Trustworthiness of Laser-Induced Breakdown Spectroscopy Predictions via
Simulation-based Synthetic Data Augmentation and Multitask Learning [4.633997895806144]
We consider quantitative analyses of spectral data using laser-induced breakdown spectroscopy.
We address the small size of training data available, and the validation of the predictions during inference on unknown data.
arXiv Detail & Related papers (2022-10-07T18:00:09Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z) - Unsupervised Machine Learning for Exploratory Data Analysis of Exoplanet
Transmission Spectra [68.8204255655161]
We focus on unsupervised techniques for analyzing spectral data from transiting exoplanets.
We show that there is a high degree of correlation in the spectral data, which calls for appropriate low-dimensional representations.
We uncover interesting structures in the principal component basis, namely, well-defined branches corresponding to different chemical regimes.
arXiv Detail & Related papers (2022-01-07T22:26:33Z) - BenchML: an extensible pipelining framework for benchmarking
representations of materials and molecules at scale [0.0]
We introduce a machine-learning framework for benchmarking representations of chemical systems against datasets of materials and molecules.
The guiding principle is to evaluate raw descriptor performance by limiting model complexity to simple regression schemes.
The resulting models are intended as baselines that can inform future method development.
arXiv Detail & Related papers (2021-12-04T09:07:16Z) - Meta-learning framework with applications to zero-shot time-series
forecasting [82.61728230984099]
This work provides positive evidence using a broad meta-learning framework.
residual connections act as a meta-learning adaptation mechanism.
We show that it is viable to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining.
arXiv Detail & Related papers (2020-02-07T16:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.