MADGEN: Mass-Spec attends to De Novo Molecular generation
- URL: http://arxiv.org/abs/2501.01950v2
- Date: Wed, 08 Jan 2025 20:09:16 GMT
- Title: MADGEN: Mass-Spec attends to De Novo Molecular generation
- Authors: Yinkai Wang, Xiaohui Chen, Liping Liu, Soha Hassoun,
- Abstract summary: We propose a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data.
MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation.
We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym)
- Score: 16.89017809745962
- License:
- Abstract: The annotation (assigning structural chemical identities) of MS/MS spectra remains a significant challenge due to the enormous molecular diversity in biological samples and the limited scope of reference databases. Currently, the vast majority of spectral measurements remain in the "dark chemical space" without structural annotations. To improve annotation, we propose MADGEN (Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data. MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation starting with the scaffold. In the first stage, given an MS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ contrastive learning to align mass spectra with candidate molecular scaffolds. In the second stage, starting from the retrieved scaffold, we employ the MS/MS spectrum to guide an attention-based generative model to generate the final molecule. Our approach constrains the molecular generation search space, reducing its complexity and improving generation accuracy. We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever.
Related papers
- DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra [60.39311767532607]
DiffMS is a formula-restricted encoder-decoder generative network.
We develop a robust decoder that bridges latent embeddings and molecular structures.
Experiments show DiffMS outperforms existing models on $textitde novo$ molecule generation.
arXiv Detail & Related papers (2025-02-13T18:29:48Z) - JESTR: Joint Embedding Space Technique for Ranking Candidate Molecules for the Annotation of Untargeted Metabolomics Data [8.964879518873591]
We introduce a novel paradigm (JESTR) for annotation.
Unlike prior approaches that explicitly construct molecular fingerprints or spectra, JESTR embeds their representations in a joint space.
We evaluate JESTR against mol-to-spec and spec-to-FP annotation tools on three datasets.
arXiv Detail & Related papers (2024-11-18T03:03:57Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - MassSpecGym: A benchmark for the discovery and identification of molecules [21.471140898806315]
We propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data.
Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra.
It defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation.
arXiv Detail & Related papers (2024-10-30T15:08:05Z) - Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - De-novo Identification of Small Molecules from Their GC-EI-MS Spectra [0.0]
Machine learning based emphde-novo methods, which derive molecular structure directly from its mass spectrum gained attention recently.
We present anovel method in this family, addressing aspecific usecase of GC-EI-MS spectra, which is particularly hard due to lack of additional information from the first stage of MS/MS experiments.
arXiv Detail & Related papers (2023-04-04T08:46:00Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - Graph-based Molecular Representation Learning [59.06193431883431]
Molecular representation learning (MRL) is a key step to build the connection between machine learning and chemical science.
Recently, MRL has achieved considerable progress, especially in methods based on deep molecular graph learning.
arXiv Detail & Related papers (2022-07-08T17:43:20Z) - Ensemble Spectral Prediction (ESP) Model for Metabolite Annotation [10.640447979978436]
Key challenge in metabolomics is annotating measured spectra from a biological sample with chemical identities.
We propose a novel machine learning model, Ensemble Spectral Prediction (ESP), for metabolite annotation.
arXiv Detail & Related papers (2022-03-25T17:05:41Z) - Unsupervised Machine Learning for Exploratory Data Analysis of Exoplanet
Transmission Spectra [68.8204255655161]
We focus on unsupervised techniques for analyzing spectral data from transiting exoplanets.
We show that there is a high degree of correlation in the spectral data, which calls for appropriate low-dimensional representations.
We uncover interesting structures in the principal component basis, namely, well-defined branches corresponding to different chemical regimes.
arXiv Detail & Related papers (2022-01-07T22:26:33Z) - MassFormer: Tandem Mass Spectrum Prediction for Small Molecules using
Graph Transformers [3.2951121243459522]
Tandem mass spectra capture fragmentation patterns that provide key structural information about a molecule.
For over seventy years, spectrum prediction has remained a key challenge in the field.
We propose a new model, MassFormer, for accurately predicting tandem mass spectra.
arXiv Detail & Related papers (2021-11-08T20:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.