ProGen2: Exploring the Boundaries of Protein Language Models
- URL: http://arxiv.org/abs/2206.13517v1
- Date: Mon, 27 Jun 2022 17:55:02 GMT
- Title: ProGen2: Exploring the Boundaries of Protein Language Models
- Authors: Erik Nijkamp, Jeffrey Ruffolo, Eli N. Weinstein, Nikhil Naik, Ali
Madani
- Abstract summary: We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters.
ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences.
As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model.
- Score: 15.82416400246896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention-based models trained on protein sequences have demonstrated
incredible success at classification and generation tasks relevant for
artificial intelligence-driven protein design. However, we lack a sufficient
understanding of how very large-scale models and data play a role in effective
protein model development. We introduce a suite of protein language models,
named ProGen2, that are scaled up to 6.4B parameters and trained on different
sequence datasets drawn from over a billion proteins from genomic, metagenomic,
and immune repertoire databases. ProGen2 models show state-of-the-art
performance in capturing the distribution of observed evolutionary sequences,
generating novel viable sequences, and predicting protein fitness without
additional finetuning. As large model sizes and raw numbers of protein
sequences continue to become more widely accessible, our results suggest that a
growing emphasis needs to be placed on the data distribution provided to a
protein sequence model. We release the ProGen2 models and code at
https://github.com/salesforce/progen.
Related papers
- SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Design Proteins Using Large Language Models: Enhancements and Comparative Analyses [12.140433802768733]
We adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences.
We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures.
Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models.
arXiv Detail & Related papers (2024-08-12T08:17:27Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Protein Structure and Sequence Generation with Equivariant Denoising
Diffusion Probabilistic Models [3.5450828190071646]
An important task in bioengineering is designing proteins with specific 3D structures and chemical properties which enable targeted functions.
We introduce a generative model of both protein structure and sequence that can operate at significantly larger scales than previous molecular generative modeling approaches.
arXiv Detail & Related papers (2022-05-26T16:10:09Z) - RITA: a Study on Scaling Up Generative Protein Sequence Models [3.6748639131154315]
RITA is a suite of autoregressive generative models for protein sequences with up to 1.2 billion parameters.
We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain.
arXiv Detail & Related papers (2022-05-11T22:06:03Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Modeling Protein Using Large-scale Pretrain Language Model [12.568452480689578]
Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets.
Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences.
Our model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences.
arXiv Detail & Related papers (2021-08-17T04:13:11Z) - ProGen: Language Modeling for Protein Generation [47.32931317203297]
Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science.
We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations.
arXiv Detail & Related papers (2020-03-08T04:27:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.