PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence
Understanding
- URL: http://arxiv.org/abs/2206.02096v1
- Date: Sun, 5 Jun 2022 05:21:56 GMT
- Title: PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence
Understanding
- Authors: Minghao Xu, Zuobai Zhang, Jiarui Lu, Zhaocheng Zhu, Yangtian Zhang,
Chang Ma, Runcheng Liu, Jian Tang
- Abstract summary: PEER is a comprehensive and multi-task benchmark for Protein sEquence undERstanding.
It provides a set of diverse protein understanding tasks including protein function prediction, protein localization prediction, protein structure prediction, and protein-ligand interaction prediction.
We evaluate different types of sequence-based methods for each task including traditional feature engineering approaches, different sequence encoding methods as well as large-scale pre-trained protein language models.
- Score: 17.770721291090258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We are now witnessing significant progress of deep learning methods in a
variety of tasks (or datasets) of proteins. However, there is a lack of a
standard benchmark to evaluate the performance of different methods, which
hinders the progress of deep learning in this field. In this paper, we propose
such a benchmark called PEER, a comprehensive and multi-task benchmark for
Protein sEquence undERstanding. PEER provides a set of diverse protein
understanding tasks including protein function prediction, protein localization
prediction, protein structure prediction, protein-protein interaction
prediction, and protein-ligand interaction prediction. We evaluate different
types of sequence-based methods for each task including traditional feature
engineering approaches, different sequence encoding methods as well as
large-scale pre-trained protein language models. In addition, we also
investigate the performance of these methods under the multi-task learning
setting. Experimental results show that large-scale pre-trained protein
language models achieve the best performance for most individual tasks, and
jointly training multiple tasks further boosts the performance. The datasets
and source codes of this benchmark will be open-sourced soon.
Related papers
- OneProt: Towards Multi-Modal Protein Foundation Models [5.440531199006399]
We introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data.
It surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction.
This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.
arXiv Detail & Related papers (2024-11-07T16:54:54Z) - Enhancing Protein Predictive Models via Proteins Data Augmentation: A
Benchmark and New Directions [58.819567030843025]
This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks.
We propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution.
Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA)
arXiv Detail & Related papers (2024-03-01T07:58:29Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - ProtIR: Iterative Refinement between Retrievers and Predictors for
Protein Function Annotation [38.019425619750265]
We introduce a novel variational pseudo-likelihood framework, ProtIR, designed to improve function predictors by incorporating inter-protein similarity modeling.
ProtIR showcases around 10% improvement over vanilla predictor-based methods.
It achieves performance on par with protein language model-based methods, yet without the need for massive pre-training.
arXiv Detail & Related papers (2024-02-10T17:31:46Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers [18.498779242323582]
We propose a novel approach, Prot2Text, which predicts a protein's function in a free text style.
By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types.
arXiv Detail & Related papers (2023-07-25T09:35:43Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Multi-Scale Representation Learning on Proteins [78.31410227443102]
This paper introduces a multi-scale graph construction of a protein -- HoloProt.
The surface captures coarser details of the protein, while sequence as primary component and structure captures finer details.
Our graph encoder then learns a multi-scale representation by allowing each level to integrate the encoding from level(s) below with the graph at that level.
arXiv Detail & Related papers (2022-04-04T08:29:17Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z) - DIPS-Plus: The Enhanced Database of Interacting Protein Structures for
Interface Prediction [2.697420611471228]
We present DIPS-Plus, an enhanced, feature-rich dataset of 42,112 complexes for geometric deep learning of protein interfaces.
The previous version of DIPS contains only the Cartesian coordinates and types of the atoms comprising a given protein complex.
DIPS-Plus now includes a plethora of new residue-level features including protrusion indices, half-sphere amino acid compositions, and new profile hidden Markov model (HMM)-based sequence features for each amino acid.
arXiv Detail & Related papers (2021-06-06T23:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.