ProtTrans: Towards Cracking the Language of Life's Code Through
Self-Supervised Deep Learning and High Performance Computing
- URL: http://arxiv.org/abs/2007.06225v3
- Date: Tue, 4 May 2021 20:18:22 GMT
- Title: ProtTrans: Towards Cracking the Language of Life's Code Through
Self-Supervised Deep Learning and High Performance Computing
- Authors: Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi,
Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin
Steinegger, Debsindhu Bhowmik, Burkhard Rost
- Abstract summary: Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP.
Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids.
For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information.
- Score: 2.747785739760799
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computational biology and bioinformatics provide vast data gold-mines from
protein sequences, ideal for Language Models taken from NLP. These LMs reach
for new prediction frontiers at low inference costs. Here, we trained two
auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models
(BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393
billion amino acids. The LMs were trained on the Summit supercomputer using
5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that
the raw protein LM-embeddings from unlabeled data captured some biophysical
features of protein sequences. We validated the advantage of using the
embeddings as exclusive input for several subsequent tasks. The first was a
per-residue prediction of protein secondary structure (3-state accuracy
Q3=81%-87%); the second were per-protein predictions of protein sub-cellular
localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble
(2-state accuracy Q2=91%). For the per-residue predictions the transfer of the
most informative embeddings (ProtT5) for the first time outperformed the
state-of-the-art without using evolutionary information thereby bypassing
expensive database searches. Taken together, the results implied that protein
LMs learned some of the grammar of the language of life. To facilitate future
work, we released our models at https://github.com/agemagician/ProtTrans.
Related papers
- LA4SR: illuminating the dark proteome with generative AI [39.58317527488534]
We re-engineered open-source AI language models (LMs) for microbial sequence classification.
The models achieved F1 scores up to 95 and operated 16,580x faster.
We provide custom AI explainability software tools for attributing amino acid patterns to AI generative processes.
arXiv Detail & Related papers (2024-11-11T08:51:18Z) - Training Compute-Optimal Protein Language Models [48.79416103951816]
Most protein language models are trained with extensive compute resources until performance gains plateau.
Our investigation is grounded in a massive dataset consisting of 939 million protein sequences.
We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens.
arXiv Detail & Related papers (2024-11-04T14:58:37Z) - Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z) - ProtST: Multi-Modality Learning of Protein Sequences and Biomedical
Texts [22.870765825298268]
We build a ProtST dataset to augment protein sequences with text descriptions of their functions and other important properties.
During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction.
On downstream tasks, ProtST enables both supervised learning and zero-shot prediction.
arXiv Detail & Related papers (2023-01-28T00:58:48Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.