Related papers: ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

URL: http://arxiv.org/abs/2007.06225v3
Date: Tue, 4 May 2021 20:18:22 GMT
Title: ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing
Authors: Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost
Abstract summary: Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information.
Score: 2.747785739760799
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.

Related papers

Multi-modal Representation Learning Enables Accurate Protein Function Prediction in Low-Data Setting [0.0]
HOPER (HOlistic ProtEin Representation) is a novel framework designed to enhance protein function prediction (PFP) in low-data settings. Our results highlight the effectiveness of multimodal representation learning for overcoming data limitations in biological research.
arXiv Detail & Related papers (2024-11-22T20:13:55Z)
LA4SR: illuminating the dark proteome with generative AI [39.58317527488534]
We re-engineered open-source AI language models (LMs) for microbial sequence classification. The models achieved F1 scores up to 95 and operated 16,580x faster. We provide custom AI explainability software tools for attributing amino acid patterns to AI generative processes.
arXiv Detail & Related papers (2024-11-11T08:51:18Z)
Training Compute-Optimal Protein Language Models [48.79416103951816]
Most protein language models are trained with extensive compute resources until performance gains plateau. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens.
arXiv Detail & Related papers (2024-11-04T14:58:37Z)
Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design. Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths. We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models. We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z)
ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins. Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z)
ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions. We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z)
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously. xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z)
Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing. We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z)
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts [22.870765825298268]
We build a ProtST dataset to augment protein sequences with text descriptions of their functions and other important properties. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction.
arXiv Detail & Related papers (2023-01-28T00:58:48Z)
Reprogramming Pretrained Language Models for Protein Sequence Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences. Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.