Comparative Analysis of Formula and Structure Prediction from Tandem Mass Spectra
- URL: http://arxiv.org/abs/2601.00941v1
- Date: Fri, 02 Jan 2026 16:20:13 GMT
- Title: Comparative Analysis of Formula and Structure Prediction from Tandem Mass Spectra
- Authors: Xujun Che, Xiuxia Du, Depeng Xu,
- Abstract summary: Liquid chromatography mass spectrometry (LC-MS)-based metabolomics and exposomics aim to measure detectable small molecules in biological samples.<n>Findings have established realistic performance baselines, identified critical bottlenecks, and provided guidance to further improve compound predictions based on MS.
- Score: 3.2243643829769586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Liquid chromatography mass spectrometry (LC-MS)-based metabolomics and exposomics aim to measure detectable small molecules in biological samples. The results facilitate hypothesis-generating discovery of metabolic changes and disease mechanisms and provide information about environmental exposures and their effects on human health. Metabolomics and exposomics are made possible by the high resolving power of LC and high mass measurement accuracy of MS. However, a majority of the signals from such studies still cannot be identified or annotated using conventional library searching because existing spectral libraries are far from covering the vast chemical space captured by LC-MS/MS. To address this challenge and unleash the full potential of metabolomics and exposomics, a number of computational approaches have been developed to predict compounds based on tandem mass spectra. Published assessment of these approaches used different datasets and evaluation. To select prediction workflows for practical applications and identify areas for further improvements, we have carried out a systematic evaluation of the state-of-the-art prediction algorithms. Specifically, the accuracy of formula prediction and structure prediction was evaluated for different types of adducts. The resulting findings have established realistic performance baselines, identified critical bottlenecks, and provided guidance to further improve compound predictions based on MS.
Related papers
- FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics [22.314786276794717]
The identification and property prediction of chemical molecules is of central importance in the advancement of drug discovery and material science.<n>Deep learning models appear promising for predicting molecular structure spectra, but overall assessment remains challenging.<n>Our contribution is the creation of benchmark framework FlexMS for constructing and evaluating diverse model architectures in mass spectrum prediction.
arXiv Detail & Related papers (2026-02-26T10:05:01Z) - Conditional Generative Framework with Peak-Aware Attention for Robust Chemical Detection under Interferences [3.976291254896486]
In this paper, we propose an artificial intelligence discrimination framework based on a peak-aware conditional generative model.<n>The framework is learned with a novel peak-aware mechanism that highlights the characteristic peaks of GC-MS data.<n>In addition, chemical and solvent information is encoded in a latent vector embedded with it, allowing a conditional generative adversarial neural network to generate a synthetic GC-MS signal.
arXiv Detail & Related papers (2026-01-29T04:10:37Z) - From Human Labels to Literature: Semi-Supervised Learning of NMR Chemical Shifts at Scale [17.999080555353157]
We propose a semi-supervised framework that learns NMR chemical shifts from millions of literature-extracted spectra without explicit atom-level assignments.<n>Our results demonstrate that large-scale unlabeled spectra mined from the literature can serve as a practical and effective data source for training NMR shift models.
arXiv Detail & Related papers (2026-01-26T14:35:25Z) - How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning? [51.286853421822705]
Large language models (LLMs) have shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear.<n>We introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures.<n>Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions.
arXiv Detail & Related papers (2026-01-09T20:08:42Z) - SIGMA: Scalable Spectral Insights for LLM Collapse [51.863164847253366]
We introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework for model collapse.<n>By utilizing benchmarks that deriving and deterministic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space.<n>We demonstrate that SIGMA effectively captures the transition towards states, offering both theoretical insights into the mechanics of collapse.
arXiv Detail & Related papers (2026-01-06T19:47:11Z) - Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra [60.08608779794957]
We propose GLMR, a Generative Language Model-based Retrieval framework.<n>In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum.<n>In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures.
arXiv Detail & Related papers (2025-11-09T07:25:53Z) - LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation [0.179762320774136]
We present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space.<n>LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds.
arXiv Detail & Related papers (2025-10-30T17:13:58Z) - MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation [20.973121120131875]
Large-scale pretraining has proven effective in addressing data scarcity in other domains.<n>We propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary.<n>Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1.
arXiv Detail & Related papers (2025-10-23T14:45:28Z) - Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z) - Drug classification based on X-ray spectroscopy combined with machine learning [11.985793625437546]
X-ray absorption spectroscopy offers advantages such as ease of operation, penetrative observation, and strong substance differentiation capabilities.<n>In this study, we constructed a classification model using Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Particle Swarm Optimization (PSO)<n>The experimental results demonstrate that this model achieved higher classification accuracy compared to two other common methods, with a prediction accuracy of 99.14%.
arXiv Detail & Related papers (2025-05-04T04:49:55Z) - Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.<n>Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - ChemVise: Maximizing Out-of-Distribution Chemical Detection with the
Novel Application of Zero-Shot Learning [60.02503434201552]
This research proposes learning approximations of complex exposures from training sets of simple ones.
We demonstrate this approach to synthetic sensor responses surprisingly improves the detection of out-of-distribution obscured chemical analytes.
arXiv Detail & Related papers (2023-02-09T20:19:57Z) - Combination of Raman spectroscopy and chemometrics: A review of recent
studies published in the Spectrochimica Acta, Part A: Molecular and
Biomolecular Spectroscopy Journal [0.0]
This review considers the application of Raman spectroscopy in combination with chemometrics to study samples and their changes caused by different factors.
We summarized the best strategies for creating classification models and highlighted some common drawbacks when it comes to the application of chemometrics techniques.
arXiv Detail & Related papers (2022-10-18T13:08:20Z) - Low cost prediction of probability distributions of molecular properties
for early virtual screening [0.8702432681310399]
This article applies Hierarchical Correlation Reconstruction approach, previously applied in the analysis of demographic, financial and astronomical data.
The whole methodology constitutes therefore a great support for medicinal chemists, as it enable fast rejection of compounds with the lowest potential of desired physicochemical/ADMET characteristic.
arXiv Detail & Related papers (2022-07-21T13:29:26Z) - Unsupervised Machine Learning for Exploratory Data Analysis of Exoplanet
Transmission Spectra [68.8204255655161]
We focus on unsupervised techniques for analyzing spectral data from transiting exoplanets.
We show that there is a high degree of correlation in the spectral data, which calls for appropriate low-dimensional representations.
We uncover interesting structures in the principal component basis, namely, well-defined branches corresponding to different chemical regimes.
arXiv Detail & Related papers (2022-01-07T22:26:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.