Keeping it Simple: Language Models can learn Complex Molecular
Distributions
- URL: http://arxiv.org/abs/2112.03041v1
- Date: Mon, 6 Dec 2021 13:40:58 GMT
- Title: Keeping it Simple: Language Models can learn Complex Molecular
Distributions
- Authors: Daniel Flam-Shepherd, Kevin Zhu and Al\'an Aspuru-Guzik
- Abstract summary: We introduce several challenging generative modeling tasks by compiling especially complex distributions of molecules.
The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep generative models of molecules have grown immensely in popularity,
trained on relevant datasets, these models are used to search through chemical
space. The downstream utility of generative models for the inverse design of
novel functional compounds depends on their ability to learn a training
distribution of molecules. The most simple example is a language model that
takes the form of a recurrent neural network and generates molecules using a
string representation. More sophisticated are graph generative models, which
sequentially construct molecular graphs and typically achieve state of the art
results. However, recent work has shown that language models are more capable
than once thought, particularly in the low data regime. In this work, we
investigate the capacity of simple language models to learn distributions of
molecules. For this purpose, we introduce several challenging generative
modeling tasks by compiling especially complex distributions of molecules. On
each task, we evaluate the ability of language models as compared with two
widely used graph generative models. The results demonstrate that language
models are powerful generative models, capable of adeptly learning complex
molecular distributions -- and yield better performance than the graph models.
Language models can accurately generate: distributions of the highest scoring
penalized LogP molecules in ZINC15, multi-modal molecular distributions as well
as the largest molecules in PubChem.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning [11.862370962277938]
We present a novel generative model, BindGPT, which uses a conceptually simple but powerful approach to create 3D molecules within the protein's binding site.
We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models.
arXiv Detail & Related papers (2024-06-06T02:10:50Z) - LDMol: Text-to-Molecule Diffusion Model with Structurally Informative Latent Space [55.5427001668863]
We present a novel latent diffusion model dubbed LDMol for text-conditioned molecule generation.
LDMol comprises a molecule autoencoder that produces a learnable and structurally informative feature space.
We show that LDMol can be applied to downstream tasks such as molecule-to-text retrieval and text-guided molecule editing.
arXiv Detail & Related papers (2024-05-28T04:59:13Z) - GIT-Mol: A Multi-modal Large Language Model for Molecular Science with
Graph, Image, and Text [25.979382232281786]
We introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information.
We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity.
arXiv Detail & Related papers (2023-08-14T03:12:29Z) - Molecule Design by Latent Space Energy-Based Modeling and Gradual
Distribution Shifting [53.44684898432997]
Generation of molecules with desired chemical and biological properties is critical for drug discovery.
We propose a probabilistic generative model to capture the joint distribution of molecules and their properties.
Our method achieves very strong performances on various molecule design tasks.
arXiv Detail & Related papers (2023-06-09T03:04:21Z) - Probabilistic Generative Transformer Language models for Generative
Design of Molecules [10.412989388092084]
Generative Molecular Transformer (GMTransformer) is a probabilistic neural network model for generative design of molecules.
Our model is built on the blank filling language model originally developed for text processing.
Our models achieve high novelty and Scaf compared to other baselines.
arXiv Detail & Related papers (2022-09-20T01:51:57Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Learning Neural Generative Dynamics for Molecular Conformation
Generation [89.03173504444415]
We study how to generate molecule conformations (textiti.e., 3D structures) from a molecular graph.
We propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph.
arXiv Detail & Related papers (2021-02-20T03:17:58Z) - Learning Latent Space Energy-Based Prior Model for Molecule Generation [59.875533935578375]
We learn latent space energy-based prior model with SMILES representation for molecule modeling.
Our method is able to generate molecules with validity and uniqueness competitive with state-of-the-art models.
arXiv Detail & Related papers (2020-10-19T09:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.