Enhancing Protein Predictive Models via Proteins Data Augmentation: A
Benchmark and New Directions
- URL: http://arxiv.org/abs/2403.00875v1
- Date: Fri, 1 Mar 2024 07:58:29 GMT
- Title: Enhancing Protein Predictive Models via Proteins Data Augmentation: A
Benchmark and New Directions
- Authors: Rui Sun, Lirong Wu, Haitao Lin, Yufei Huang, Stan Z. Li
- Abstract summary: This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks.
We propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution.
Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA)
- Score: 58.819567030843025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Augmentation is an effective alternative to utilize the small amount of
labeled protein data. However, most of the existing work focuses on design-ing
new architectures or pre-training tasks, and relatively little work has studied
data augmentation for proteins. This paper extends data augmentation techniques
previously used for images and texts to proteins and then benchmarks these
techniques on a variety of protein-related tasks, providing the first
comprehensive evaluation of protein augmentation. Furthermore, we propose two
novel semantic-level protein augmentation methods, namely Integrated Gradients
Substitution and Back Translation Substitution, which enable protein
semantic-aware augmentation through saliency detection and biological
knowledge. Finally, we integrate extended and proposed augmentations into an
augmentation pool and propose a simple but effective framework, namely
Automated Protein Augmentation (APA), which can adaptively select the most
suitable augmentation combinations for different tasks. Extensive experiments
have shown that APA enhances the performance of five protein related tasks by
an average of 10.55% across three architectures compared to vanilla
implementations without augmentation, highlighting its potential to make a
great impact on the field.
Related papers
- GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning [27.192150057715835]
GOProteinGNN is a novel architecture that enhances protein language models by integrating protein knowledge graph information.
Our approach allows for the integration of information at both the individual amino acid level and the entire protein level, enabling a comprehensive and effective learning process.
arXiv Detail & Related papers (2024-07-31T17:54:22Z) - Multi-Modal CLIP-Informed Protein Editing [8.927362207499181]
We propose a novel method called ProtET for efficient CLIP-informed protein editing through multi-modality learning.
Our approach comprises two stages: in the pretraining stage, contrastive learning aligns protein-biotext representations encoded by two large language models (LLMs)
During the protein editing stage, the fused features from editing instruction texts and original protein sequences serve as the final editing condition for generating target protein sequences.
arXiv Detail & Related papers (2024-07-27T16:41:08Z) - Boosting Protein Language Models with Negative Sample Mining [20.721167029530168]
We introduce a pioneering methodology for boosting large language models in the domain of protein representation learning.
Our primary contribution lies in the refinement process for correlating the over-reliance on co-evolution knowledge.
By capitalizing on this novel approach, our technique steers the training of transformer-based models within the attention score space.
arXiv Detail & Related papers (2024-05-28T07:24:20Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Improving few-shot learning-based protein engineering with evolutionary
sampling [0.0]
We propose a few-shot learning approach to novel protein design that aims to accelerate the expensive wet lab testing cycle.
Our approach is composed of two parts: a semi-supervised transfer learning approach to generate a discrete fitness landscape for a desired protein function and a novel evolutionary Monte Carlo Chain sampling algorithm.
We demonstrate the performance of our approach by experimentally screening predicted high fitness gene activators, resulting in a dramatically improved hit rate compared to existing methods.
arXiv Detail & Related papers (2023-05-23T23:07:53Z) - A Text-guided Protein Design Framework [106.79061950107922]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design.
ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation.
We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - OntoProtein: Protein Pretraining With Gene Ontology Embedding [36.92674447484136]
We propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models.
We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph.
arXiv Detail & Related papers (2022-01-23T14:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.