ProGen: Language Modeling for Protein Generation
- URL: http://arxiv.org/abs/2004.03497v1
- Date: Sun, 8 Mar 2020 04:27:16 GMT
- Title: ProGen: Language Modeling for Protein Generation
- Authors: Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata
Anand, Raphael R. Eguchi, Po-Ssu Huang, Richard Socher
- Abstract summary: Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science.
We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations.
- Score: 47.32931317203297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative modeling for protein engineering is key to solving fundamental
problems in synthetic biology, medicine, and material science. We pose protein
engineering as an unsupervised sequence generation problem in order to leverage
the exponentially growing set of proteins that lack costly, structural
annotations. We train a 1.2B-parameter language model, ProGen, on ~280M protein
sequences conditioned on taxonomic and keyword tags such as molecular function
and cellular component. This provides ProGen with an unprecedented range of
evolutionary sequence diversity and allows it to generate with fine-grained
control as demonstrated by metrics based on primary sequence similarity,
secondary structure accuracy, and conformational energy.
Related papers
- Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Multi-level Protein Representation Learning for Blind Mutational Effect
Prediction [5.207307163958806]
This paper introduces a novel pre-training framework that cascades sequential and geometric analyzers for protein structures.
It guides mutational directions toward desired traits by simulating natural selection on wild-type proteins.
We assess the proposed approach using a public database and two new databases for a variety of variant effect prediction tasks.
arXiv Detail & Related papers (2023-06-08T03:00:50Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - Protein Sequence and Structure Co-Design with Equivariant Translation [19.816174223173494]
Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models.
We propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state.
Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features.
All protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process.
arXiv Detail & Related papers (2022-10-17T06:00:12Z) - ProGen2: Exploring the Boundaries of Protein Language Models [15.82416400246896]
We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters.
ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences.
As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model.
arXiv Detail & Related papers (2022-06-27T17:55:02Z) - Protein Structure and Sequence Generation with Equivariant Denoising
Diffusion Probabilistic Models [3.5450828190071646]
An important task in bioengineering is designing proteins with specific 3D structures and chemical properties which enable targeted functions.
We introduce a generative model of both protein structure and sequence that can operate at significantly larger scales than previous molecular generative modeling approaches.
arXiv Detail & Related papers (2022-05-26T16:10:09Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - BERTology Meets Biology: Interpreting Attention in Protein Language
Models [124.8966298974842]
We demonstrate methods for analyzing protein Transformer models through the lens of attention.
We show that attention captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure.
We also present a three-dimensional visualization of the interaction between attention and protein structure.
arXiv Detail & Related papers (2020-06-26T21:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.