Leveraging Sequence Embedding and Convolutional Neural Network for
Protein Function Prediction
- URL: http://arxiv.org/abs/2112.00344v1
- Date: Wed, 1 Dec 2021 08:31:01 GMT
- Title: Leveraging Sequence Embedding and Convolutional Neural Network for
Protein Function Prediction
- Authors: Wei-Cheng Tseng, Po-Han Chi, Jia-Hua Wu, Min Sun
- Abstract summary: Main challenges of protein function prediction are the large label space and the lack of labeled training data.
Our method leverages unsupervised sequence embedding and the success of deep convolutional neural network to overcome these challenges.
- Score: 27.212743275697825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The capability of accurate prediction of protein functions and properties is
essential in the biotechnology industry, e.g. drug development and artificial
protein synthesis, etc. The main challenges of protein function prediction are
the large label space and the lack of labeled training data. Our method
leverages unsupervised sequence embedding and the success of deep convolutional
neural network to overcome these challenges. In contrast, most of the existing
methods delete the rare protein functions to reduce the label space.
Furthermore, some existing methods require additional bio-information (e.g.,
the 3-dimensional structure of the proteins) which is difficult to be
determined in biochemical experiments. Our proposed method significantly
outperforms the other methods on the publicly available benchmark using only
protein sequences as input. This allows the process of identifying protein
functions to be sped up.
Related papers
- ProteinRPN: Towards Accurate Protein Function Prediction with Graph-Based Region Proposals [4.525216077859531]
We introduce the Protein Region Proposal Network (ProteinRPN) for accurate protein function prediction.
ProteinRPN identifies potential functional regions (anchors) which are refined through the hierarchy-aware node drop pooling layer.
The representations of the predicted functional nodes are enriched using attention mechanisms and fed into a Graph Multiset Transformer.
arXiv Detail & Related papers (2024-09-01T04:40:04Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - Unbiased organism-agnostic and highly sensitive signal peptide predictor
with deep protein language model [12.37352652557512]
Signal peptide (SP) is a short peptide located in the N-terminus of proteins.
Here we present Unbiased Organism-agnostic Signal peptide Network (USPNet), a signal peptide classification and cleavage site prediction deep learning method.
We propose to apply label distribution-aware margin loss to handle data imbalance problems and use evolutionary information of protein to enrich representation.
arXiv Detail & Related papers (2023-12-14T14:32:48Z) - DeepGATGO: A Hierarchical Pretraining-Based Graph-Attention Model for
Automatic Protein Function Prediction [4.608328575930055]
Automatic protein function prediction (AFP) is classified as a large-scale multi-label classification problem.
Currently, popular methods primarily combine protein-related information and Gene Ontology (GO) terms to generate final functional predictions.
We propose a sequence-based hierarchical prediction method, DeepGATGO, which processes protein sequences and GO term labels hierarchically.
arXiv Detail & Related papers (2023-07-24T07:01:32Z) - Multi-level Protein Representation Learning for Blind Mutational Effect
Prediction [5.207307163958806]
This paper introduces a novel pre-training framework that cascades sequential and geometric analyzers for protein structures.
It guides mutational directions toward desired traits by simulating natural selection on wild-type proteins.
We assess the proposed approach using a public database and two new databases for a variety of variant effect prediction tasks.
arXiv Detail & Related papers (2023-06-08T03:00:50Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - A Text-guided Protein Design Framework [106.79061950107922]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design.
ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation.
We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.