ECRECer: Enzyme Commission Number Recommendation and Benchmarking based
on Multiagent Dual-core Learning
- URL: http://arxiv.org/abs/2202.03632v1
- Date: Tue, 8 Feb 2022 04:00:49 GMT
- Title: ECRECer: Enzyme Commission Number Recommendation and Benchmarking based
on Multiagent Dual-core Learning
- Authors: Zhenkun Shi, Qianqian Yuan, Ruoyu Wang, Hoaran Li, Xiaoping Liao,
Hongwu Ma
- Abstract summary: We report ECRECer, a cloud platform for accurately predicting EC numbers based on novel deep learning techniques.
To build ECRECer, we evaluate different protein representation methods and adopt a protein language model for protein sequence embedding.
ECRECer delivers the highest performance, which improves accuracy and F1 score by 70% and 20% over the state-of-the-art, respectively.
- Score: 1.4114970711442507
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Enzyme Commission (EC) numbers, which associate a protein sequence with the
biochemical reactions it catalyzes, are essential for the accurate
understanding of enzyme functions and cellular metabolism. Many ab-initio
computational approaches were proposed to predict EC numbers for given input
sequences directly. However, the prediction performance (accuracy, recall,
precision), usability, and efficiency of existing methods still have much room
to be improved. Here, we report ECRECer, a cloud platform for accurately
predicting EC numbers based on novel deep learning techniques. To build
ECRECer, we evaluate different protein representation methods and adopt a
protein language model for protein sequence embedding. After embedding, we
propose a multi-agent hierarchy deep learning-based framework to learn the
proposed tasks in a multi-task manner. Specifically, we used an extreme
multi-label classifier to perform the EC prediction and employed a greedy
strategy to integrate and fine-tune the final model. Comparative analyses
against four representative methods demonstrate that ECRECer delivers the
highest performance, which improves accuracy and F1 score by 70% and 20% over
the state-of-the-the-art, respectively. With ECRECer, we can annotate numerous
enzymes in the Swiss-Prot database with incomplete EC numbers to their full
fourth level. Take UniPort protein "A0A0U5GJ41" as an example (1.14.-.-),
ECRECer annotated it with "1.14.11.38", which supported by further protein
structure analysis based on AlphaFold2. Finally, we established a webserver
(https://ecrecer.biodesign.ac.cn) and provided an offline bundle to improve
usability.
Related papers
- Autoregressive Enzyme Function Prediction with Multi-scale Multi-modality Fusion [11.278610817877578]
We introduce MAPred, a novel multi-modality and multi-scale model designed to autoregressively predict the EC number of proteins.
MAPred integrates both the primary amino acid sequence and the 3D tokens of proteins, employing a dual-pathway approach to capture comprehensive protein characteristics.
Evaluations on benchmark datasets, including New-392, Price, and New-815, demonstrate that our method outperforms existing models.
arXiv Detail & Related papers (2024-08-11T08:28:43Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z) - PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence
Understanding [17.770721291090258]
PEER is a comprehensive and multi-task benchmark for Protein sEquence undERstanding.
It provides a set of diverse protein understanding tasks including protein function prediction, protein localization prediction, protein structure prediction, and protein-ligand interaction prediction.
We evaluate different types of sequence-based methods for each task including traditional feature engineering approaches, different sequence encoding methods as well as large-scale pre-trained protein language models.
arXiv Detail & Related papers (2022-06-05T05:21:56Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - EEG-Inception: An Accurate and Robust End-to-End Neural Network for
EEG-based Motor Imagery Classification [123.93460670568554]
This paper proposes a novel convolutional neural network (CNN) architecture for accurate and robust EEG-based motor imagery (MI) classification.
The proposed CNN model, namely EEG-Inception, is built on the backbone of the Inception-Time network.
The proposed network is an end-to-end classification, as it takes the raw EEG signals as the input and does not require complex EEG signal-preprocessing.
arXiv Detail & Related papers (2021-01-24T19:03:10Z) - Pre-training Protein Language Models with Label-Agnostic Binding Pairs
Enhances Performance in Downstream Tasks [1.452875650827562]
Less than 1% of protein sequences are structurally and functionally annotated.
We present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences.
We suggest that Transformer's attention mechanism contributes to protein binding site discovery.
arXiv Detail & Related papers (2020-12-05T17:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.