ProtIR: Iterative Refinement between Retrievers and Predictors for
Protein Function Annotation
- URL: http://arxiv.org/abs/2402.07955v1
- Date: Sat, 10 Feb 2024 17:31:46 GMT
- Title: ProtIR: Iterative Refinement between Retrievers and Predictors for
Protein Function Annotation
- Authors: Zuobai Zhang, Jiarui Lu, Vijil Chenthamarakshan, Aur\'elie Lozano,
Payel Das, Jian Tang
- Abstract summary: We introduce a novel variational pseudo-likelihood framework, ProtIR, designed to improve function predictors by incorporating inter-protein similarity modeling.
ProtIR showcases around 10% improvement over vanilla predictor-based methods.
It achieves performance on par with protein language model-based methods, yet without the need for massive pre-training.
- Score: 38.019425619750265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protein function annotation is an important yet challenging task in biology.
Recent deep learning advancements show significant potential for accurate
function prediction by learning from protein sequences and structures.
Nevertheless, these predictor-based methods often overlook the modeling of
protein similarity, an idea commonly employed in traditional approaches using
sequence or structure retrieval tools. To fill this gap, we first study the
effect of inter-protein similarity modeling by benchmarking retriever-based
methods against predictors on protein function annotation tasks. Our results
show that retrievers can match or outperform predictors without large-scale
pre-training. Building on these insights, we introduce a novel variational
pseudo-likelihood framework, ProtIR, designed to improve function predictors by
incorporating inter-protein similarity modeling. This framework iteratively
refines knowledge between a function predictor and retriever, thereby combining
the strengths of both predictors and retrievers. ProtIR showcases around 10%
improvement over vanilla predictor-based methods. Besides, it achieves
performance on par with protein language model-based methods, yet without the
need for massive pre-training, highlighting the efficacy of our framework. Code
will be released upon acceptance.
Related papers
- SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Protein-Mamba: Biological Mamba Models for Protein Function Prediction [18.642511763423048]
Protein-Mamba is a novel two-stage model that leverages both self-supervised learning and fine-tuning to improve protein function prediction.
Our experiments demonstrate that Protein-Mamba achieves competitive performance, compared with a couple of state-of-the-art methods.
arXiv Detail & Related papers (2024-09-22T22:51:56Z) - Structure-Informed Protein Language Model [38.019425619750265]
We introduce the integration of remote homology detection to distill structural information into protein language models.
We evaluate the impact of this structure-informed training on downstream protein function prediction tasks.
arXiv Detail & Related papers (2024-02-07T09:32:35Z) - FABind: Fast and Accurate Protein-Ligand Binding [127.7790493202716]
$mathbfFABind$ is an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding.
Our proposed model demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods.
arXiv Detail & Related papers (2023-10-10T16:39:47Z) - DeepGATGO: A Hierarchical Pretraining-Based Graph-Attention Model for
Automatic Protein Function Prediction [4.608328575930055]
Automatic protein function prediction (AFP) is classified as a large-scale multi-label classification problem.
Currently, popular methods primarily combine protein-related information and Gene Ontology (GO) terms to generate final functional predictions.
We propose a sequence-based hierarchical prediction method, DeepGATGO, which processes protein sequences and GO term labels hierarchically.
arXiv Detail & Related papers (2023-07-24T07:01:32Z) - Predicting protein variants with equivariant graph neural networks [0.0]
We compare the abilities of equivariant graph neural networks (EGNNs) and sequence-based approaches to identify promising amino-acid mutations.
Our proposed structural approach achieves a competitive performance to sequence-based approaches while being trained on significantly fewer molecules.
arXiv Detail & Related papers (2023-06-21T12:44:52Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.