Modeling Protein Using Large-scale Pretrain Language Model
- URL: http://arxiv.org/abs/2108.07435v1
- Date: Tue, 17 Aug 2021 04:13:11 GMT
- Title: Modeling Protein Using Large-scale Pretrain Language Model
- Authors: Yijia Xiao, Jiezhong Qiu, Ziang Li, Chang-Yu Hsieh, Jie Tang
- Abstract summary: Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets.
Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences.
Our model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences.
- Score: 12.568452480689578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protein is linked to almost every life process. Therefore, analyzing the
biological structure and property of protein sequences is critical to the
exploration of life, as well as disease detection and drug discovery.
Traditional protein analysis methods tend to be labor-intensive and
time-consuming. The emergence of deep learning models makes modeling data
patterns in large quantities of data possible. Interdisciplinary researchers
have begun to leverage deep learning methods to model large biological
datasets, e.g. using long short-term memory and convolutional neural network
for protein sequence classification. After millions of years of evolution,
evolutionary information is encoded in protein sequences. Inspired by the
similarity between natural language and protein sequences, we use large-scale
language models to model evolutionary-scale protein sequences, encoding protein
biology information in representation. Significant improvements are observed in
both token-level and sequence-level tasks, demonstrating that our large-scale
model can accurately capture evolution information from pretraining on
evolutionary-scale individual sequences. Our code and model are available at
https://github.com/THUDM/ProteinLM.
Related papers
- SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - SESNet: sequence-structure feature-integrated deep learning method for
data-efficient protein engineering [6.216757583450049]
We develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants.
We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship.
Our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants.
arXiv Detail & Related papers (2022-12-29T01:49:52Z) - Integration of Pre-trained Protein Language Models into Geometric Deep
Learning Networks [68.90692290665648]
We integrate knowledge learned by protein language models into several state-of-the-art geometric networks.
Our findings show an overall improvement of 20% over baselines.
Strong evidence indicates that the incorporation of protein language models' knowledge enhances geometric networks' capacity by a significant margin.
arXiv Detail & Related papers (2022-12-07T04:04:04Z) - ProGen2: Exploring the Boundaries of Protein Language Models [15.82416400246896]
We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters.
ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences.
As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model.
arXiv Detail & Related papers (2022-06-27T17:55:02Z) - Protein Structure and Sequence Generation with Equivariant Denoising
Diffusion Probabilistic Models [3.5450828190071646]
An important task in bioengineering is designing proteins with specific 3D structures and chemical properties which enable targeted functions.
We introduce a generative model of both protein structure and sequence that can operate at significantly larger scales than previous molecular generative modeling approaches.
arXiv Detail & Related papers (2022-05-26T16:10:09Z) - Learning multi-scale functional representations of proteins from
single-cell microscopy data [77.34726150561087]
We show that simple convolutional networks trained on localization classification can learn protein representations that encapsulate diverse functional information.
We also propose a robust evaluation strategy to assess quality of protein representations across different scales of biological function.
arXiv Detail & Related papers (2022-05-24T00:00:07Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z) - Align-gram : Rethinking the Skip-gram Model for Protein Sequence
Analysis [0.8733639720576208]
We propose a novel embedding scheme, Align-gram, which is capable of mapping the similar $k$-mers close to each other in a vector space.
Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
arXiv Detail & Related papers (2020-12-06T17:04:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.