2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision
- URL: http://arxiv.org/abs/2505.18181v1
- Date: Fri, 16 May 2025 18:02:05 GMT
- Title: 2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision
- Authors: Yunrui Li, Hao Xu, Pengyu Hong,
- Abstract summary: We introduce 2DNMRGym, the first annotated experimental dataset designed for Machine Learning-based representation learning in 2D NMR.<n>2DNMRGym adopts a surrogate supervision setup: models are trained using algorithm-generated annotations from a previously validated method.<n>We provide benchmark results using a series of 2D and 3D GNN and GNN transformer models, establishing a strong foundation for future work.
- Score: 7.470166291890153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Two-dimensional (2D) Nuclear Magnetic Resonance (NMR) spectroscopy, particularly Heteronuclear Single Quantum Coherence (HSQC) spectroscopy, plays a critical role in elucidating molecular structures, interactions, and electronic properties. However, accurately interpreting 2D NMR data remains labor-intensive and error-prone, requiring highly trained domain experts, especially for complex molecules. Machine Learning (ML) holds significant potential in 2D NMR analysis by learning molecular representations and recognizing complex patterns from data. However, progress has been limited by the lack of large-scale and high-quality annotated datasets. In this work, we introduce 2DNMRGym, the first annotated experimental dataset designed for ML-based molecular representation learning in 2D NMR. It includes over 22,000 HSQC spectra, along with the corresponding molecular graphs and SMILES strings. Uniquely, 2DNMRGym adopts a surrogate supervision setup: models are trained using algorithm-generated annotations derived from a previously validated method and evaluated on a held-out set of human-annotated gold-standard labels. This enables rigorous assessment of a model's ability to generalize from imperfect supervision to expert-level interpretation. We provide benchmark results using a series of 2D and 3D GNN and GNN transformer models, establishing a strong foundation for future work. 2DNMRGym supports scalable model training and introduces a chemically meaningful benchmark for evaluating atom-level molecular representations in NMR-guided structural tasks. Our data and code is open-source and available on Huggingface and Github.
Related papers
- DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models [66.41802970528133]
Molecular structure elucidation from spectra is a foundational problem in chemistry.<n>Traditional methods rely heavily on expert interpretation and lack scalability.<n>We present DiffSpectra, a generative framework that directly infers both 2D and 3D molecular structures from multi-modal spectral data.
arXiv Detail & Related papers (2025-07-09T13:57:20Z) - Combining Graph Neural Networks and Mixed Integer Linear Programming for Molecular Inference under the Two-Layered Model [6.107266553770076]
We develop a molecular inference framework based on mol-infer, namely mol-infer-GNN, that utilizes GNN as the learning method.<n>Our proposed GNN model can obtain satisfying learning performances for some properties despite its simple structure.
arXiv Detail & Related papers (2025-07-05T06:57:37Z) - GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition [60.76623665324548]
GTR-Mol-VLM is a novel framework featuring two key innovations.<n>It emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions.<n>MolRec-Bench is the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR.
arXiv Detail & Related papers (2025-06-09T08:47:10Z) - Using machine learning to map simulated noisy and laser-limited multidimensional spectra to molecular electronic couplings [0.0]
We show how factors associated with experimental 2D spectral data influence the ability of NNs to map simulated 2DES spectra onto intermolecular electronic couplings.<n>In stark contrast to human-based analyses of 2DES data, we find that the NN accuracy improves significantly when the data are constrained by the bandwidth and center frequency of the pump pulses.
arXiv Detail & Related papers (2025-03-19T21:40:00Z) - DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra [60.39311767532607]
We present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task.<n>To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs.<n>Experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation.
arXiv Detail & Related papers (2025-02-13T18:29:48Z) - Graph-neural-network predictions of solid-state NMR parameters from spherical tensor decomposition [0.0]
Nuclear magnetic resonance (NMR) is a powerful spectroscopic technique that is sensitive to the local atomic structure of matter.<n>Machine learning (ML) has emerged as an efficient route to making such predictions.
arXiv Detail & Related papers (2024-12-19T17:11:07Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - TransPeakNet: Solvent-Aware 2D NMR Prediction via Multi-Task Pre-Training and Unsupervised Learning [5.7279868722119325]
We introduce an unsupervised training framework for predicting cross-peaks in 2D NMR.<n>Our approach pretrains an ML model on an annotated 1D dataset of 1H and 13C shifts, then finetunes it in an unsupervised manner.<n> Evaluation on 479 expert-annotated HSQC spectra demonstrates our model's superiority over traditional methods.
arXiv Detail & Related papers (2024-03-17T21:52:51Z) - Carbohydrate NMR chemical shift predictions using E(3) equivariant graph
neural networks [0.0]
This work introduces a novel approach that leverages E(3) equivariant graph neural networks to predict carbohydrate NMR spectra.
Notably, our model achieves a substantial reduction in mean absolute error, up to threefold, compared to traditional models.
The implications are far-reaching and go beyond an advanced understanding of carbohydrate structures and spectral interpretation.
arXiv Detail & Related papers (2023-11-21T15:01:14Z) - Confidence-guided Lesion Mask-based Simultaneous Synthesis of Anatomic
and Molecular MR Images in Patients with Post-treatment Malignant Gliomas [65.64363834322333]
Confidence Guided SAMR (CG-SAMR) synthesizes data from lesion information to multi-modal anatomic sequences.
module guides the synthesis based on confidence measure about the intermediate results.
experiments on real clinical data demonstrate that the proposed model can perform better than the state-of-theart synthesis methods.
arXiv Detail & Related papers (2020-08-06T20:20:22Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z) - Multi-View Graph Neural Networks for Molecular Property Prediction [67.54644592806876]
We present Multi-View Graph Neural Network (MV-GNN), a multi-view message passing architecture.
In MV-GNN, we introduce a shared self-attentive readout component and disagreement loss to stabilize the training process.
We further boost the expressive power of MV-GNN by proposing a cross-dependent message passing scheme.
arXiv Detail & Related papers (2020-05-17T04:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.