Atom-by-atom protein generation and beyond with language models
- URL: http://arxiv.org/abs/2308.09482v1
- Date: Wed, 16 Aug 2023 17:56:17 GMT
- Title: Atom-by-atom protein generation and beyond with language models
- Authors: Daniel Flam-Shepherd, Kevin Zhu and Al\'an Aspuru-Guzik
- Abstract summary: We show that chemical language models can learn atom-level representations of proteins enabling protein generation unconstrained to the standard genetic code.
We demonstrate language models are able to explore beyond protein space -- generating proteins with modified sidechains that form unnatural amino acids.
- Score: 2.2765901220053606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein language models learn powerful representations directly from
sequences of amino acids. However, they are constrained to generate proteins
with only the set of amino acids represented in their vocabulary. In contrast,
chemical language models learn atom-level representations of smaller molecules
that include every atom, bond, and ring. In this work, we show that chemical
language models can learn atom-level representations of proteins enabling
protein generation unconstrained to the standard genetic code and far beyond
it. In doing so, we show that language models can generate entire proteins atom
by atom -- effectively learning the multiple hierarchical layers of molecular
information that define proteins from their primary sequence to their
secondary, and tertiary structure. We demonstrate language models are able to
explore beyond protein space -- generating proteins with modified sidechains
that form unnatural amino acids. Even further, we find that language models can
explore chemical space and protein space simultaneously and generate novel
examples of protein-drug conjugates. The results demonstrate the potential for
biomolecular design at the atom level using language models.
Related papers
- An All-Atom Generative Model for Designing Protein Complexes [49.09672038729524]
APM (All-Atom Protein Generative Model) is a model specifically designed for modeling multi-chain proteins.
By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch.
arXiv Detail & Related papers (2025-04-17T16:37:41Z) - Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.
Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z) - MolMetaLM: a Physicochemical Knowledge-Guided Molecular Meta Language Model [19.458584012046646]
We propose a novel physicochemical knowledge-guided molecular meta language framework MolMetaLM.
We design a molecule-specialized meta language paradigm, formatted as multiple S,P,O> knowledge triples sharing the same S (i.e., molecule)
By introducing different molecular knowledge and noises, the meta language paradigm generates tens of thousands of pretraining tasks.
arXiv Detail & Related papers (2024-11-23T09:27:38Z) - Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular Modeling [32.656601823957345]
ESM-AA (ESM All-Atom) is a novel approach that enables atom-scale and residue-scale unified molecular modeling.
Experimental results indicate that ESM-AA surpasses previous methods in protein-molecule tasks.
arXiv Detail & Related papers (2024-03-05T13:35:41Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - Language models can generate molecules, materials, and protein binding
sites directly in three dimensions as XYZ, CIF, and PDB files [0.0]
Language models are powerful tools for molecular design.
We show how language models can generate novel and valid structures in three dimensions.
Despite being trained on chemical file sequences, language models still achieve performance comparable to state-of-the-art models.
arXiv Detail & Related papers (2023-05-09T18:35:38Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - DiffBP: Generative Diffusion of 3D Molecules for Target Protein Binding [51.970607704953096]
Previous works usually generate atoms in an auto-regressive way, where element types and 3D coordinates of atoms are generated one by one.
In real-world molecular systems, the interactions among atoms in an entire molecule are global, leading to the energy function pair-coupled among atoms.
In this work, a generative diffusion model for molecular 3D structures based on target proteins is established, at a full-atom level in a non-autoregressive way.
arXiv Detail & Related papers (2022-11-21T07:02:15Z) - Molecular dynamics without molecules: searching the conformational space
of proteins with generative neural networks [0.0]
All-atom and coarse-grained molecular dynamics are widely used to study the conformational states of proteins.
All-atom and coarse-grained simulation methods suffer from the fact that without access to supercomputing resources, the time and length scales at which these states become detectable are difficult to achieve.
One alternative is based on encoding the atomistic trajectory of molecular dynamics as a shorthand version of physical particles, and then learning to propagate the encoded trajectory through the use of artificial intelligence vectors.
arXiv Detail & Related papers (2022-06-09T02:06:43Z) - Learning Latent Space Energy-Based Prior Model for Molecule Generation [59.875533935578375]
We learn latent space energy-based prior model with SMILES representation for molecule modeling.
Our method is able to generate molecules with validity and uniqueness competitive with state-of-the-art models.
arXiv Detail & Related papers (2020-10-19T09:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.