De Novo Molecular Generation from Mass Spectra via Many-Body Enhanced Diffusion
- URL: http://arxiv.org/abs/2602.01643v1
- Date: Mon, 02 Feb 2026 05:00:00 GMT
- Title: De Novo Molecular Generation from Mass Spectra via Many-Body Enhanced Diffusion
- Authors: Xichen Sun, Wentao Wei, Jiahua Rao, Jiancong Xie, Yuedong Yang,
- Abstract summary: We present MBGen, a Many-Body enhanced diffusion framework for de novo molecular structure Generation from mass spectra.<n>By integrating a many-body attention mechanism and higher-order edge modeling, MBGen comprehensively leverages the rich structural information encoded in MS/MS spectra.<n>Our approach effectively captures higher-order interactions and exhibits enhanced sensitivity to complex isomeric and non-local fragmentation information.
- Score: 10.739105148401629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Molecular structure generation from mass spectrometry is fundamental for understanding cellular metabolism and discovering novel compounds. Although tandem mass spectrometry (MS/MS) enables the high-throughput acquisition of fragment fingerprints, these spectra often reflect higher-order interactions involving the concerted cleavage of multiple atoms and bonds-crucial for resolving complex isomers and non-local fragmentation mechanisms. However, most existing methods adopt atom-centric and pairwise interaction modeling, overlooking higher-order edge interactions and lacking the capacity to systematically capture essential many-body characteristics for structure generation. To overcome these limitations, we present MBGen, a Many-Body enhanced diffusion framework for de novo molecular structure Generation from mass spectra. By integrating a many-body attention mechanism and higher-order edge modeling, MBGen comprehensively leverages the rich structural information encoded in MS/MS spectra, enabling accurate de novo generation and isomer differentiation for novel molecules. Experimental results on the NPLIB1 and MassSpecGym benchmarks demonstrate that MBGen achieves superior performance, with improvements of up to 230% over state-of-the-art methods, highlighting the scientific value and practical utility of many-body modeling for mass spectrometry-based molecular generation. Further analysis and ablation studies show that our approach effectively captures higher-order interactions and exhibits enhanced sensitivity to complex isomeric and non-local fragmentation information.
Related papers
- NMIRacle: Multi-modal Generative Molecular Elucidation from IR and NMR Spectra [13.594833907772783]
We introduce NMIRacle, a two-stage generative framework that builds upon recent paradigms in AI-driven spectroscopy with minimal assumptions.<n>In the first stage, NMIRacle learns to reconstruct molecular structures from count-aware fragment encodings.<n>In the second stage, a spectral encoder maps input spectroscopic measurements into a latent embedding.<n>This formulation bridges fragment-level chemical modeling with spectral evidence, yielding accurate molecular predictions.
arXiv Detail & Related papers (2025-12-17T10:29:39Z) - Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling [74.25438319700929]
We propose CHMR (Cell-aware Hierarchical Multi-modal Representations), a robust framework that models local-global dependencies between molecules and cellular responses.<n> evaluated on nine public benchmarks spanning 728 tasks, CHMR outperforms state-of-the-art baselines.<n>Results demonstrate the advantage of hierarchy-aware, multimodal learning for reliable and biologically grounded molecular representations.
arXiv Detail & Related papers (2025-11-26T07:15:00Z) - Mamba-driven multi-perspective structural understanding for molecular ground-state conformation prediction [69.32436472760712]
We propose an approach of Mamba-driven multi-perspective structural understanding (MPSU-Mamba) to localize molecular ground-state conformation.<n>For complex and diverse molecules, three different kinds of dedicated scanning strategies are explored to construct a comprehensive perception of corresponding molecular structures.<n> Experimental results on QM9 and Molecule3D datasets indicate that MPSU-Mamba significantly outperforms existing methods.
arXiv Detail & Related papers (2025-11-10T11:18:32Z) - Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra [60.08608779794957]
We propose GLMR, a Generative Language Model-based Retrieval framework.<n>In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum.<n>In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures.
arXiv Detail & Related papers (2025-11-09T07:25:53Z) - Test-Time Tuned Language Models Enable End-to-end De Novo Molecular Structure Generation from MS/MS Spectra [31.563216077422084]
Tandem Mass Spectrometry enables the identification of unknown compounds in crucial fields such as metabolomics, natural product discovery and environmental analysis.<n>We introduce a framework that, by leveraging test-time tuning, enhances the learning of a pre-trained transformer model to address this gap.<n>We surpass the de-facto state-of-the-art approach DiffMS on two popular benchmarks NPLIB1 and MassSpecGym by 100% and 20%, respectively.
arXiv Detail & Related papers (2025-10-27T18:25:36Z) - MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation [20.973121120131875]
Large-scale pretraining has proven effective in addressing data scarcity in other domains.<n>We propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary.<n>Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1.
arXiv Detail & Related papers (2025-10-23T14:45:28Z) - $\ ext{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z) - DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models [68.19129717255053]
We present DiffSpectra, a generative framework that formulates molecular structure elucidation as a conditional generation process.<n>Our experiments demonstrate that DiffSpectra accurately elucidates molecular structures, achieving 40.76% top-1 and 99.49% top-10 accuracy.
arXiv Detail & Related papers (2025-07-09T13:57:20Z) - DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra [60.39311767532607]
We present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task.<n>To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs.<n>Experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation.
arXiv Detail & Related papers (2025-02-13T18:29:48Z) - MADGEN: Mass-Spec attends to De Novo Molecular generation [16.89017809745962]
We propose a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data.<n> MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation.<n>We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym)
arXiv Detail & Related papers (2025-01-03T18:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.