GEOM: Energy-annotated molecular conformations for property prediction
and molecular generation
- URL: http://arxiv.org/abs/2006.05531v4
- Date: Wed, 9 Feb 2022 23:10:12 GMT
- Title: GEOM: Energy-annotated molecular conformations for property prediction
and molecular generation
- Authors: Simon Axelrod, Rafael Gomez-Bombarelli
- Abstract summary: We use advanced sampling and semi-empirical density functional theory to generate 37 million molecular conformations for over 450,000 molecules.
The dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning (ML) outperforms traditional approaches in many molecular
design tasks. ML models usually predict molecular properties from a 2D chemical
graph or a single 3D structure, but neither of these representations accounts
for the ensemble of 3D conformers that are accessible to a molecule. Property
prediction could be improved by using conformer ensembles as input, but there
is no large-scale dataset that contains graphs annotated with accurate
conformers and experimental data. Here we use advanced sampling and
semi-empirical density functional theory (DFT) to generate 37 million molecular
conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules
(GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000
species with experimental data related to biophysics, physiology, and physical
chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also
labeled with high-quality DFT free energies in an implicit water solvent, and
534 ensembles are further optimized with DFT. GEOM will assist in the
development of models that predict properties from conformer ensembles, and
generative models that sample 3D conformations.
Related papers
- QMe14S, A Comprehensive and Efficient Spectral Dataset for Small Organic Molecules [10.076287990554901]
We introduce the QMe14S dataset, comprising 186,102 small organic molecules featuring 14 elements.
We optimized geometries and calculated properties including energy, atomic charge, atomic force, dipole moment, quadrupole moment, polarizability, octupole moment, first hyperpolarizability, and Hessian.
We demonstrate that models trained on QMe14S outperform those trained on the previously developed QM9S dataset in simulating molecular spectra.
arXiv Detail & Related papers (2025-01-31T04:12:53Z) - M$^{3}$-20M: A Large-Scale Multi-Modal Molecule Dataset for AI-driven Drug Design and Discovery [23.60901496004578]
M$3$-20M is 71 times more in the number of molecules than the largest existing dataset.
This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions.
arXiv Detail & Related papers (2024-12-08T03:43:07Z) - BAPULM: Binding Affinity Prediction using Language Models [7.136205674624813]
We introduce BAPULM, an innovative sequence-based framework that leverages the chemical latent representations of proteins via ProtT5-XL-U50 and through MolFormer.
Our approach was validated extensively on benchmark datasets, achieving sequential scoring power (R) values of 0.925 $pm$ 0.043, 0.914 $pm$ 0.004, and 0.8132 $pm$ 0.001 on benchmark1k2101, Test2016_290, and CSAR-HiQ_36, respectively.
arXiv Detail & Related papers (2024-11-06T04:35:30Z) - Pre-training of Molecular GNNs via Conditional Boltzmann Generator [0.0]
We propose a pre-training method for molecular GNNs using an existing dataset of molecular conformations.
We show that our model has a better prediction performance for molecular properties than existing pre-training methods.
arXiv Detail & Related papers (2023-12-20T15:30:15Z) - SE(3)-Invariant Multiparameter Persistent Homology for Chiral-Sensitive
Molecular Property Prediction [1.534667887016089]
We present a novel method for generating molecular fingerprints using multi parameter persistent homology (MPPH)
This technique holds considerable significance for drug discovery and materials science, where precise molecular property prediction is vital.
We demonstrate its superior performance over existing state-of-the-art methods in predicting molecular properties through extensive evaluations on the MoleculeNet benchmark.
arXiv Detail & Related papers (2023-12-12T09:33:54Z) - Automated 3D Pre-Training for Molecular Property Prediction [54.15788181794094]
We propose a novel 3D pre-training framework (dubbed 3D PGT)
It pre-trains a model on 3D molecular graphs, and then fine-tunes it on molecular graphs without 3D structures.
Extensive experiments on 2D molecular graphs are conducted to demonstrate the accuracy, efficiency and generalization ability of the proposed 3D PGT.
arXiv Detail & Related papers (2023-06-13T14:43:13Z) - Molecule Design by Latent Space Energy-Based Modeling and Gradual
Distribution Shifting [53.44684898432997]
Generation of molecules with desired chemical and biological properties is critical for drug discovery.
We propose a probabilistic generative model to capture the joint distribution of molecules and their properties.
Our method achieves very strong performances on various molecule design tasks.
arXiv Detail & Related papers (2023-06-09T03:04:21Z) - An Equivariant Generative Framework for Molecular Graph-Structure
Co-Design [54.92529253182004]
We present MolCode, a machine learning-based generative framework for underlineMolecular graph-structure underlineCo-design.
In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure.
Our investigation reveals that the 2D topology and 3D geometry contain intrinsically complementary information in molecule design.
arXiv Detail & Related papers (2023-04-12T13:34:22Z) - Geometry-Complete Diffusion for 3D Molecule Generation and Optimization [3.8366697175402225]
We introduce the Geometry-Complete Diffusion Model (GCDM) for 3D molecule generation.
GCDM outperforms existing 3D molecular diffusion models by significant margins across conditional and unconditional settings.
We also show that GCDM's geometric features can be repurposed to consistently optimize the geometry and chemical composition of existing 3D molecules.
arXiv Detail & Related papers (2023-02-08T20:01:51Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - Molecular Geometry-aware Transformer for accurate 3D Atomic System
modeling [51.83761266429285]
We propose a novel Transformer architecture that takes nodes (atoms) and edges (bonds and nonbonding atom pairs) as inputs and models the interactions among them.
Moleformer achieves state-of-the-art on the initial state to relaxed energy prediction of OC20 and is very competitive in QM9 on predicting quantum chemical properties.
arXiv Detail & Related papers (2023-02-02T03:49:57Z) - Augmenting Molecular Deep Generative Models with Topological Data
Analysis Representations [21.237758981760784]
We present a SMILES Variational Auto-Encoder (VAE) augmented with topological data analysis (TDA) representations of molecules.
Our experiments show that this TDA augmentation enables a SMILES VAE to capture the complex relation between 3D geometry and electronic properties.
arXiv Detail & Related papers (2021-06-08T15:49:21Z) - BIGDML: Towards Exact Machine Learning Force Fields for Materials [55.944221055171276]
Machine-learning force fields (MLFF) should be accurate, computationally and data efficient, and applicable to molecules, materials, and interfaces thereof.
Here, we introduce the Bravais-Inspired Gradient-Domain Machine Learning approach and demonstrate its ability to construct reliable force fields using a training set with just 10-200 atoms.
arXiv Detail & Related papers (2021-06-08T10:14:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.