Related papers: PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding

PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding

URL: http://arxiv.org/abs/2206.02096v1
Date: Sun, 5 Jun 2022 05:21:56 GMT
Title: PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding
Authors: Minghao Xu, Zuobai Zhang, Jiarui Lu, Zhaocheng Zhu, Yangtian Zhang, Chang Ma, Runcheng Liu, Jian Tang
Abstract summary: PEER is a comprehensive and multi-task benchmark for Protein sEquence undERstanding. It provides a set of diverse protein understanding tasks including protein function prediction, protein localization prediction, protein structure prediction, and protein-ligand interaction prediction. We evaluate different types of sequence-based methods for each task including traditional feature engineering approaches, different sequence encoding methods as well as large-scale pre-trained protein language models.
Score: 17.770721291090258
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We are now witnessing significant progress of deep learning methods in a variety of tasks (or datasets) of proteins. However, there is a lack of a standard benchmark to evaluate the performance of different methods, which hinders the progress of deep learning in this field. In this paper, we propose such a benchmark called PEER, a comprehensive and multi-task benchmark for Protein sEquence undERstanding. PEER provides a set of diverse protein understanding tasks including protein function prediction, protein localization prediction, protein structure prediction, protein-protein interaction prediction, and protein-ligand interaction prediction. We evaluate different types of sequence-based methods for each task including traditional feature engineering approaches, different sequence encoding methods as well as large-scale pre-trained protein language models. In addition, we also investigate the performance of these methods under the multi-task learning setting. Experimental results show that large-scale pre-trained protein language models achieve the best performance for most individual tasks, and jointly training multiple tasks further boosts the performance. The datasets and source codes of this benchmark will be open-sourced soon.

Related papers

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction [19.164841536081568]
We introduce Prot2Token, a unified framework that overcomes challenges by converting a wide spectrum of protein-related predictions.<n>At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens.<n>We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks.
arXiv Detail & Related papers (2025-05-26T23:50:36Z)
Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations [0.3124884279860061]
Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences.<n>Our research investigated a multi-task pre-training strategy for PLMs.<n>This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences.
arXiv Detail & Related papers (2025-05-26T14:41:10Z)
Protein Large Language Models: A Comprehensive Survey [71.65899614084853]
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. This work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications.
arXiv Detail & Related papers (2025-02-21T19:22:10Z)
Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm. Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z)
ProtCLIP: Function-Informed Protein Multi-Modal Learning [18.61302416993122]
We develop ProtCLIP, a multi-modality foundation model that represents function-aware protein embeddings. Our ProtCLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks. The experimental results verify the extraordinary potential of ProtCLIP serving as the protein multi-modality foundation model.
arXiv Detail & Related papers (2024-12-28T04:23:47Z)
OneProt: Towards Multi-Modal Protein Foundation Models [5.440531199006399]
We introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data. It surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction. This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.
arXiv Detail & Related papers (2024-11-07T16:54:54Z)
DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning [11.832967054454546]
DeepProtein is a user-friendly deep learning library tailored for protein-related tasks. It enables researchers to seamlessly address protein data with cutting-edge deep learning models. DeepProt-T5, a series of fine-tuned Prot-T5-based models, achieve state-of-the-art performance on four benchmark tasks.
arXiv Detail & Related papers (2024-10-02T20:42:32Z)
Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New Directions [58.819567030843025]
This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks. We propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution. Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA)
arXiv Detail & Related papers (2024-03-01T07:58:29Z)
ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs. By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z)
ProtIR: Iterative Refinement between Retrievers and Predictors for Protein Function Annotation [38.019425619750265]
We introduce a novel variational pseudo-likelihood framework, ProtIR, designed to improve function predictors by incorporating inter-protein similarity modeling. ProtIR showcases around 10% improvement over vanilla predictor-based methods. It achieves performance on par with protein language model-based methods, yet without the need for massive pre-training.
arXiv Detail & Related papers (2024-02-10T17:31:46Z)
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously. xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z)
Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers [18.498779242323582]
We propose a novel approach, Prot2Text, which predicts a protein's function in a free text style. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types.
arXiv Detail & Related papers (2023-07-25T09:35:43Z)
Reprogramming Pretrained Language Models for Protein Sequence Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences. Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z)
Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information. We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
Multi-Scale Representation Learning on Proteins [78.31410227443102]
This paper introduces a multi-scale graph construction of a protein -- HoloProt. The surface captures coarser details of the protein, while sequence as primary component and structure captures finer details. Our graph encoder then learns a multi-scale representation by allowing each level to integrate the encoding from level(s) below with the graph at that level.
arXiv Detail & Related papers (2022-04-04T08:29:17Z)
Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences. We first present a simple yet effective encoder to learn protein geometry features. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z)
DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction [2.697420611471228]
We present DIPS-Plus, an enhanced, feature-rich dataset of 42,112 complexes for geometric deep learning of protein interfaces. The previous version of DIPS contains only the Cartesian coordinates and types of the atoms comprising a given protein complex. DIPS-Plus now includes a plethora of new residue-level features including protrusion indices, half-sphere amino acid compositions, and new profile hidden Markov model (HMM)-based sequence features for each amino acid.
arXiv Detail & Related papers (2021-06-06T23:56:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.