MADGEN: Mass-Spec attends to De Novo Molecular generation
- URL: http://arxiv.org/abs/2501.01950v4
- Date: Tue, 29 Apr 2025 16:27:32 GMT
- Title: MADGEN: Mass-Spec attends to De Novo Molecular generation
- Authors: Yinkai Wang, Xiaohui Chen, Liping Liu, Soha Hassoun,
- Abstract summary: We propose a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data.<n> MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation.<n>We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym)
- Score: 16.89017809745962
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The annotation (assigning structural chemical identities) of MS/MS spectra remains a significant challenge due to the enormous molecular diversity in biological samples and the limited scope of reference databases. Currently, the vast majority of spectral measurements remain in the "dark chemical space" without structural annotations. To improve annotation, we propose MADGEN (Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data. MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation starting with the scaffold. In the first stage, given an MS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ contrastive learning to align mass spectra with candidate molecular scaffolds. In the second stage, starting from the retrieved scaffold, we employ the MS/MS spectrum to guide an attention-based generative model to generate the final molecule. Our approach constrains the molecular generation search space, reducing its complexity and improving generation accuracy. We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever.
Related papers
- De novo molecular structure elucidation from mass spectra via flow matching [5.274388013166468]
We develop MSFlow, a two-stage encoder-decoder flow-matching generative model that achieves state-of-the-art performance on the structure elucidation task for small molecules.<n>MSFlow can accurately translate up to 45 percent of molecular mass spectra into their corresponding molecular representations - an improvement of up to fourteen-fold over the current state-of-the-art.
arXiv Detail & Related papers (2026-02-23T14:52:53Z) - How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning? [51.286853421822705]
Large language models (LLMs) have shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear.<n>We introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures.<n>Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions.
arXiv Detail & Related papers (2026-01-09T20:08:42Z) - NMIRacle: Multi-modal Generative Molecular Elucidation from IR and NMR Spectra [13.594833907772783]
We introduce NMIRacle, a two-stage generative framework that builds upon recent paradigms in AI-driven spectroscopy with minimal assumptions.<n>In the first stage, NMIRacle learns to reconstruct molecular structures from count-aware fragment encodings.<n>In the second stage, a spectral encoder maps input spectroscopic measurements into a latent embedding.<n>This formulation bridges fragment-level chemical modeling with spectral evidence, yielding accurate molecular predictions.
arXiv Detail & Related papers (2025-12-17T10:29:39Z) - Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra [60.08608779794957]
We propose GLMR, a Generative Language Model-based Retrieval framework.<n>In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum.<n>In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures.
arXiv Detail & Related papers (2025-11-09T07:25:53Z) - Test-Time Tuned Language Models Enable End-to-end De Novo Molecular Structure Generation from MS/MS Spectra [31.563216077422084]
Tandem Mass Spectrometry enables the identification of unknown compounds in crucial fields such as metabolomics, natural product discovery and environmental analysis.<n>We introduce a framework that, by leveraging test-time tuning, enhances the learning of a pre-trained transformer model to address this gap.<n>We surpass the de-facto state-of-the-art approach DiffMS on two popular benchmarks NPLIB1 and MassSpecGym by 100% and 20%, respectively.
arXiv Detail & Related papers (2025-10-27T18:25:36Z) - $\ ext{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z) - DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models [66.41802970528133]
Molecular structure elucidation from spectra is a foundational problem in chemistry.<n>Traditional methods rely heavily on expert interpretation and lack scalability.<n>We present DiffSpectra, a generative framework that directly infers both 2D and 3D molecular structures from multi-modal spectral data.
arXiv Detail & Related papers (2025-07-09T13:57:20Z) - DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra [60.39311767532607]
DiffMS is a formula-restricted encoder-decoder generative network.
We develop a robust decoder that bridges latent embeddings and molecular structures.
Experiments show DiffMS outperforms existing models on $textitde novo$ molecule generation.
arXiv Detail & Related papers (2025-02-13T18:29:48Z) - JESTR: Joint Embedding Space Technique for Ranking Candidate Molecules for the Annotation of Untargeted Metabolomics Data [8.964879518873591]
We introduce a novel paradigm (JESTR) for annotation.
Unlike prior approaches that explicitly construct molecular fingerprints or spectra, JESTR embeds their representations in a joint space.
We evaluate JESTR against mol-to-spec and spec-to-FP annotation tools on three datasets.
arXiv Detail & Related papers (2024-11-18T03:03:57Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - MassSpecGym: A benchmark for the discovery and identification of molecules [21.471140898806315]
We propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data.
Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra.
It defines three MS/MS annotation challenges: textitde novo molecular structure generation, molecule retrieval, and spectrum simulation.
arXiv Detail & Related papers (2024-10-30T15:08:05Z) - Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - De-novo Identification of Small Molecules from Their GC-EI-MS Spectra [0.0]
Machine learning based emphde-novo methods, which derive molecular structure directly from its mass spectrum gained attention recently.
We present anovel method in this family, addressing aspecific usecase of GC-EI-MS spectra, which is particularly hard due to lack of additional information from the first stage of MS/MS experiments.
arXiv Detail & Related papers (2023-04-04T08:46:00Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - Graph-based Molecular Representation Learning [59.06193431883431]
Molecular representation learning (MRL) is a key step to build the connection between machine learning and chemical science.
Recently, MRL has achieved considerable progress, especially in methods based on deep molecular graph learning.
arXiv Detail & Related papers (2022-07-08T17:43:20Z) - Ensemble Spectral Prediction (ESP) Model for Metabolite Annotation [10.640447979978436]
Key challenge in metabolomics is annotating measured spectra from a biological sample with chemical identities.
We propose a novel machine learning model, Ensemble Spectral Prediction (ESP), for metabolite annotation.
arXiv Detail & Related papers (2022-03-25T17:05:41Z) - Unsupervised Machine Learning for Exploratory Data Analysis of Exoplanet
Transmission Spectra [68.8204255655161]
We focus on unsupervised techniques for analyzing spectral data from transiting exoplanets.
We show that there is a high degree of correlation in the spectral data, which calls for appropriate low-dimensional representations.
We uncover interesting structures in the principal component basis, namely, well-defined branches corresponding to different chemical regimes.
arXiv Detail & Related papers (2022-01-07T22:26:33Z) - MassFormer: Tandem Mass Spectrum Prediction for Small Molecules using
Graph Transformers [3.2951121243459522]
Tandem mass spectra capture fragmentation patterns that provide key structural information about a molecule.
For over seventy years, spectrum prediction has remained a key challenge in the field.
We propose a new model, MassFormer, for accurately predicting tandem mass spectra.
arXiv Detail & Related papers (2021-11-08T20:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.