Language models can generate molecules, materials, and protein binding
sites directly in three dimensions as XYZ, CIF, and PDB files
- URL: http://arxiv.org/abs/2305.05708v1
- Date: Tue, 9 May 2023 18:35:38 GMT
- Title: Language models can generate molecules, materials, and protein binding
sites directly in three dimensions as XYZ, CIF, and PDB files
- Authors: Daniel Flam-Shepherd and Al\'an Aspuru-Guzik
- Abstract summary: Language models are powerful tools for molecular design.
We show how language models can generate novel and valid structures in three dimensions.
Despite being trained on chemical file sequences, language models still achieve performance comparable to state-of-the-art models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models are powerful tools for molecular design. Currently, the
dominant paradigm is to parse molecular graphs into linear string
representations that can easily be trained on. This approach has been very
successful, however, it is limited to chemical structures that can be
completely represented by a graph -- like organic molecules -- while materials
and biomolecular structures like protein binding sites require a more complete
representation that includes the relative positioning of their atoms in space.
In this work, we show how language models, without any architecture
modifications, trained using next-token prediction -- can generate novel and
valid structures in three dimensions from various substantially different
distributions of chemical structures. In particular, we demonstrate that
language models trained directly on sequences derived directly from chemical
file formats like XYZ files, Crystallographic Information files (CIFs), or
Protein Data Bank files (PDBs) can directly generate molecules, crystals, and
protein binding sites in three dimensions. Furthermore, despite being trained
on chemical file sequences -- language models still achieve performance
comparable to state-of-the-art models that use graph and graph-derived string
representations, as well as other domain-specific 3D generative models. In
doing so, we demonstrate that it is not necessary to use simplified molecular
representations to train chemical language models -- that they are powerful
generative models capable of directly exploring chemical space in three
dimensions for very different structures.
Related papers
- GraphXForm: Graph transformer for computer-aided molecular design with application to extraction [73.1842164721868]
We present GraphXForm, a decoder-only graph transformer architecture, which is pretrained on existing compounds and then fine-tuned.
We evaluate it on two solvent design tasks for liquid-liquid extraction, showing that it outperforms four state-of-the-art molecular design techniques.
arXiv Detail & Related papers (2024-11-03T19:45:15Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - 3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization [41.07090635630771]
3D-MolT5 is a unified framework designed to model both 1D molecular sequence and 3D molecular structure.
Key innovation lies in our methodology for mapping fine-grained 3D substructure representations to a specialized 3D token vocabulary.
Our proposed 3D-MolT5 shows superior performance than existing methods in molecular property prediction, molecule captioning, and text-based molecule generation tasks.
arXiv Detail & Related papers (2024-06-09T14:20:55Z) - BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning [11.862370962277938]
We present a novel generative model, BindGPT, which uses a conceptually simple but powerful approach to create 3D molecules within the protein's binding site.
We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models.
arXiv Detail & Related papers (2024-06-06T02:10:50Z) - GIT-Mol: A Multi-modal Large Language Model for Molecular Science with
Graph, Image, and Text [25.979382232281786]
We introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information.
We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity.
arXiv Detail & Related papers (2023-08-14T03:12:29Z) - Generation of 3D Molecules in Pockets via Language Model [0.0]
Generative models for molecules based on sequential line notation (e.g. SMILES) or graph representation have attracted an increasing interest in the field of structure-based drug design.
We introduce Lingo3DMol, a pocket-based 3D molecule generation method that combines language models and geometric deep learning technology.
arXiv Detail & Related papers (2023-05-17T11:31:06Z) - MUDiff: Unified Diffusion for Complete Molecule Generation [104.7021929437504]
We present a new model for generating a comprehensive representation of molecules, including atom features, 2D discrete molecule structures, and 3D continuous molecule coordinates.
We propose a novel graph transformer architecture to denoise the diffusion process.
Our model is a promising approach for designing stable and diverse molecules and can be applied to a wide range of tasks in molecular modeling.
arXiv Detail & Related papers (2023-04-28T04:25:57Z) - An Equivariant Generative Framework for Molecular Graph-Structure
Co-Design [54.92529253182004]
We present MolCode, a machine learning-based generative framework for underlineMolecular graph-structure underlineCo-design.
In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure.
Our investigation reveals that the 2D topology and 3D geometry contain intrinsically complementary information in molecule design.
arXiv Detail & Related papers (2023-04-12T13:34:22Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Scalable Fragment-Based 3D Molecular Design with Reinforcement Learning [68.8204255655161]
We introduce a novel framework for scalable 3D design that uses a hierarchical agent to build molecules.
In a variety of experiments, we show that our agent, guided only by energy considerations, can efficiently learn to produce molecules with over 100 atoms.
arXiv Detail & Related papers (2022-02-01T18:54:24Z) - Keeping it Simple: Language Models can learn Complex Molecular
Distributions [0.0]
We introduce several challenging generative modeling tasks by compiling especially complex distributions of molecules.
The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions.
arXiv Detail & Related papers (2021-12-06T13:40:58Z) - Learning Latent Space Energy-Based Prior Model for Molecule Generation [59.875533935578375]
We learn latent space energy-based prior model with SMILES representation for molecule modeling.
Our method is able to generate molecules with validity and uniqueness competitive with state-of-the-art models.
arXiv Detail & Related papers (2020-10-19T09:34:20Z) - Learning a Continuous Representation of 3D Molecular Structures with
Deep Generative Models [0.0]
Generative models are an entirely different approach that learn to represent and optimize molecules in a continuous latent space.
We describe deep generative models of three dimensional molecular structures using atomic density grids.
We are also able to sample diverse sets of molecules based on a given input compound to increase the probability of creating valid, drug-like molecules.
arXiv Detail & Related papers (2020-10-17T01:15:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.