Related papers: Analysis of Atom-level pretraining with Quantum Mechanics (QM) data for Graph Neural Networks Molecular property models

Analysis of Atom-level pretraining with Quantum Mechanics (QM) data for Graph Neural Networks Molecular property models

URL: http://arxiv.org/abs/2405.14837v2
Date: Mon, 27 May 2024 07:56:06 GMT
Title: Analysis of Atom-level pretraining with Quantum Mechanics (QM) data for Graph Neural Networks Molecular property models
Authors: Jose Arjona-Medina, Ramil Nugmanov,
Abstract summary: We show how atom-level pretraining with quantum mechanics (QM) data can mitigate violations of assumptions regarding the distributional similarity between training and test data. This is the first time that hidden state molecular representations are analyzed to compare the effects of molecule-level and atom-level pretraining on QM data.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the rapid and significant advancements in deep learning for Quantitative Structure-Activity Relationship (QSAR) models, the challenge of learning robust molecular representations that effectively generalize in real-world scenarios to novel compounds remains an elusive and unresolved task. This study examines how atom-level pretraining with quantum mechanics (QM) data can mitigate violations of assumptions regarding the distributional similarity between training and test data and therefore improve performance and generalization in downstream tasks. In the public dataset Therapeutics Data Commons (TDC), we show how pretraining on atom-level QM improves performance overall and makes the activation of the features distributes more Gaussian-like which results in a representation that is more robust to distribution shifts. To the best of our knowledge, this is the first time that hidden state molecular representations are analyzed to compare the effects of molecule-level and atom-level pretraining on QM data.

Related papers

Quantum-Accelerated Neural Imputation with Large Language Models (LLMs) [0.0]
This paper introduces Quantum-UnIMP, a novel framework that integrates shallow quantum circuits into an LLM-based imputation architecture.<n>Our experiments on benchmark mixed-type datasets demonstrate that Quantum-UnIMP reduces imputation error by up to 15.2% for numerical features (RMSE) and improves classification accuracy by 8.7% for categorical features (F1-Score) compared to state-of-the-art classical and LLM-based methods.
arXiv Detail & Related papers (2025-07-11T02:00:06Z)
Ensemble Knowledge Distillation for Machine Learning Interatomic Potentials [34.82692226532414]
Machine learning interatomic potentials (MLIPs) are a promising tool to accelerate atomistic simulations and molecular property prediction. The quality of MLIPs depends on the quantity of available training data as well as the quantum chemistry (QC) level of theory used to generate that data. We present an ensemble knowledge distillation (EKD) method to improve MLIP accuracy when trained to energy-only datasets.
arXiv Detail & Related papers (2025-03-18T14:32:51Z)
Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z)
Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling [38.53065398127086]
We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features. We find that models pretrained on atomic quantum mechanical properties capture more low-frequency laplacian eigenmodes.
arXiv Detail & Related papers (2024-10-10T15:20:30Z)
ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z)
QKSAN: A Quantum Kernel Self-Attention Network [53.96779043113156]
A Quantum Kernel Self-Attention Mechanism (QKSAM) is introduced to combine the data representation merit of Quantum Kernel Methods (QKM) with the efficient information extraction capability of SAM. A Quantum Kernel Self-Attention Network (QKSAN) framework is proposed based on QKSAM, which ingeniously incorporates the Deferred Measurement Principle (DMP) and conditional measurement techniques. Four QKSAN sub-models are deployed on PennyLane and IBM Qiskit platforms to perform binary classification on MNIST and Fashion MNIST.
arXiv Detail & Related papers (2023-08-25T15:08:19Z)
Atomic and Subgraph-aware Bilateral Aggregation for Molecular Representation Learning [57.670845619155195]
We introduce a new model for molecular representation learning called the Atomic and Subgraph-aware Bilateral Aggregation (ASBA) ASBA addresses the limitations of previous atom-wise and subgraph-wise models by incorporating both types of information. Our method offers a more comprehensive way to learn representations for molecular property prediction and has broad potential in drug and material discovery applications.
arXiv Detail & Related papers (2023-05-22T00:56:00Z)
Pre-training via Denoising for Molecular Property Prediction [53.409242538744444]
We describe a pre-training technique that utilizes large datasets of 3D molecular structures at equilibrium. Inspired by recent advances in noise regularization, our pre-training objective is based on denoising.
arXiv Detail & Related papers (2022-05-31T22:28:34Z)
Improving Molecular Representation Learning with Metric Learning-enhanced Optimal Transport [49.237577649802034]
We develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems. MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances.
arXiv Detail & Related papers (2022-02-13T04:56:18Z)
Do Large Scale Molecular Language Representations Capture Important Structural Information? [31.76876206167457]
We present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer. Experiments show that the learned molecular representation performs competitively, when compared to graph-based and fingerprint-based supervised learning baselines.
arXiv Detail & Related papers (2021-06-17T14:33:55Z)
Predicting molecular dipole moments by combining atomic partial charges and atomic dipoles [3.0980025155565376]
"MuML" models are fitted together to reproduce molecular $boldsymbolmu$ computed using high-level coupled-cluster theory. We demonstrate that the uncertainty in the predictions can be estimated reliably using a calibrated committee model.
arXiv Detail & Related papers (2020-03-27T14:35:37Z)
Embedded-physics machine learning for coarse-graining and collective variable discovery without data [3.222802562733787]
We present a novel learning framework that consistently embeds underlying physics. We propose a novel objective based on reverse Kullback-Leibler divergence that fully incorporates the available physics in the form of the atomistic force field. We demonstrate the algorithmic advances in terms of predictive ability and the physical meaning of the revealed CVs for a bimodal potential energy function and the alanine dipeptide.
arXiv Detail & Related papers (2020-02-24T10:28:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.