MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation
- URL: http://arxiv.org/abs/2510.20615v1
- Date: Thu, 23 Oct 2025 14:45:28 GMT
- Title: MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation
- Authors: Yang Han, Pengyu Wang, Kai Yu, Xin Chen, Lu Chen,
- Abstract summary: Large-scale pretraining has proven effective in addressing data scarcity in other domains.<n>We propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary.<n>Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1.
- Score: 20.973121120131875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.
Related papers
- De Novo Molecular Generation from Mass Spectra via Many-Body Enhanced Diffusion [10.739105148401629]
We present MBGen, a Many-Body enhanced diffusion framework for de novo molecular structure Generation from mass spectra.<n>By integrating a many-body attention mechanism and higher-order edge modeling, MBGen comprehensively leverages the rich structural information encoded in MS/MS spectra.<n>Our approach effectively captures higher-order interactions and exhibits enhanced sensitivity to complex isomeric and non-local fragmentation information.
arXiv Detail & Related papers (2026-02-02T05:00:00Z) - How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning? [51.286853421822705]
Large language models (LLMs) have shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear.<n>We introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures.<n>Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions.
arXiv Detail & Related papers (2026-01-09T20:08:42Z) - SIGMA: Scalable Spectral Insights for LLM Collapse [51.863164847253366]
We introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework for model collapse.<n>By utilizing benchmarks that deriving and deterministic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space.<n>We demonstrate that SIGMA effectively captures the transition towards states, offering both theoretical insights into the mechanics of collapse.
arXiv Detail & Related papers (2026-01-06T19:47:11Z) - Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra [60.08608779794957]
We propose GLMR, a Generative Language Model-based Retrieval framework.<n>In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum.<n>In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures.
arXiv Detail & Related papers (2025-11-09T07:25:53Z) - Test-Time Tuned Language Models Enable End-to-end De Novo Molecular Structure Generation from MS/MS Spectra [31.563216077422084]
Tandem Mass Spectrometry enables the identification of unknown compounds in crucial fields such as metabolomics, natural product discovery and environmental analysis.<n>We introduce a framework that, by leveraging test-time tuning, enhances the learning of a pre-trained transformer model to address this gap.<n>We surpass the de-facto state-of-the-art approach DiffMS on two popular benchmarks NPLIB1 and MassSpecGym by 100% and 20%, respectively.
arXiv Detail & Related papers (2025-10-27T18:25:36Z) - Language Models Can Understand Spectra: A Multimodal Model for Molecular Structure Elucidation [9.987376780022345]
We propose SpectraLLM, the first large language model designed to support multi-modal spectroscopic joint reasoning.<n>By integrating continuous and discrete spectroscopic modalities into a shared semantic space, SpectraLLM learns to uncover substructural patterns that are consistent and complementary across spectra.<n>We pretrain and fine-tune SpectraLLM in the domain of small molecules, and evaluate it on six standardized, publicly available chemical datasets.
arXiv Detail & Related papers (2025-08-04T13:33:38Z) - SpectrumFM: Redefining Spectrum Cognition via Foundation Modeling [65.65474629224558]
We propose a spectrum foundation model, termed SpectrumFM, which provides a new paradigm for spectrum cognition.<n>An innovative spectrum encoder that exploits the convolutional neural networks is proposed to effectively capture both fine-grained local signal structures and high-level global dependencies in the spectrum data.<n>Two novel self-supervised learning tasks, namely masked reconstruction and next-slot signal prediction, are developed for pre-training SpectrumFM, enabling the model to learn rich and transferable representations.
arXiv Detail & Related papers (2025-08-02T14:40:50Z) - DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models [66.41802970528133]
Molecular structure elucidation from spectra is a foundational problem in chemistry.<n>Traditional methods rely heavily on expert interpretation and lack scalability.<n>We present DiffSpectra, a generative framework that directly infers both 2D and 3D molecular structures from multi-modal spectral data.
arXiv Detail & Related papers (2025-07-09T13:57:20Z) - DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra [60.39311767532607]
We present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task.<n>To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs.<n>Experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation.
arXiv Detail & Related papers (2025-02-13T18:29:48Z) - Mass Spectra Prediction with Structural Motif-based Graph Neural
Networks [21.71309513265843]
MoMS-Net is a system that predicts mass spectra using the information derived from structural motifs and the implementation of Graph Neural Networks (GNNs)
We have tested our model across diverse mass spectra and have observed its superiority over other existing models.
arXiv Detail & Related papers (2023-06-28T10:33:57Z) - Towards Predicting Equilibrium Distributions for Molecular Systems with
Deep Learning [60.02391969049972]
We introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems.
DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system.
arXiv Detail & Related papers (2023-06-08T17:12:08Z) - MassFormer: Tandem Mass Spectrum Prediction for Small Molecules using
Graph Transformers [3.2951121243459522]
Tandem mass spectra capture fragmentation patterns that provide key structural information about a molecule.
For over seventy years, spectrum prediction has remained a key challenge in the field.
We propose a new model, MassFormer, for accurately predicting tandem mass spectra.
arXiv Detail & Related papers (2021-11-08T20:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.