ProGen2: Exploring the Boundaries of Protein Language Models
- URL: http://arxiv.org/abs/2206.13517v1
- Date: Mon, 27 Jun 2022 17:55:02 GMT
- Title: ProGen2: Exploring the Boundaries of Protein Language Models
- Authors: Erik Nijkamp, Jeffrey Ruffolo, Eli N. Weinstein, Nikhil Naik, Ali
Madani
- Abstract summary: We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters.
ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences.
As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model.
- Score: 15.82416400246896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention-based models trained on protein sequences have demonstrated
incredible success at classification and generation tasks relevant for
artificial intelligence-driven protein design. However, we lack a sufficient
understanding of how very large-scale models and data play a role in effective
protein model development. We introduce a suite of protein language models,
named ProGen2, that are scaled up to 6.4B parameters and trained on different
sequence datasets drawn from over a billion proteins from genomic, metagenomic,
and immune repertoire databases. ProGen2 models show state-of-the-art
performance in capturing the distribution of observed evolutionary sequences,
generating novel viable sequences, and predicting protein fitness without
additional finetuning. As large model sizes and raw numbers of protein
sequences continue to become more widely accessible, our results suggest that a
growing emphasis needs to be placed on the data distribution provided to a
protein sequence model. We release the ProGen2 models and code at
https://github.com/salesforce/progen.
Related papers
- xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - SESNet: sequence-structure feature-integrated deep learning method for
data-efficient protein engineering [6.216757583450049]
We develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants.
We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship.
Our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants.
arXiv Detail & Related papers (2022-12-29T01:49:52Z) - Protein Structure and Sequence Generation with Equivariant Denoising
Diffusion Probabilistic Models [3.5450828190071646]
An important task in bioengineering is designing proteins with specific 3D structures and chemical properties which enable targeted functions.
We introduce a generative model of both protein structure and sequence that can operate at significantly larger scales than previous molecular generative modeling approaches.
arXiv Detail & Related papers (2022-05-26T16:10:09Z) - RITA: a Study on Scaling Up Generative Protein Sequence Models [3.6748639131154315]
RITA is a suite of autoregressive generative models for protein sequences with up to 1.2 billion parameters.
We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain.
arXiv Detail & Related papers (2022-05-11T22:06:03Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Modeling Protein Using Large-scale Pretrain Language Model [12.568452480689578]
Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets.
Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences.
Our model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences.
arXiv Detail & Related papers (2021-08-17T04:13:11Z) - ProGen: Language Modeling for Protein Generation [47.32931317203297]
Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science.
We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations.
arXiv Detail & Related papers (2020-03-08T04:27:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.