Training self-supervised peptide sequence models on artificially chopped
proteins
- URL: http://arxiv.org/abs/2211.06428v1
- Date: Wed, 9 Nov 2022 22:22:17 GMT
- Title: Training self-supervised peptide sequence models on artificially chopped
proteins
- Authors: Gil Sadeh, Zichen Wang, Jasleen Grewal, Huzefa Rangwala, Layne Price
- Abstract summary: We propose a new peptide data augmentation scheme, where we train peptide language models on "chopped proteins"
We evaluate the representation potential of models trained with chopped proteins versus natural peptides.
We demonstrate improved zero-shot learning performance for a deep mutational scan peptides benchmark.
- Score: 12.715029139379393
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Representation learning for proteins has primarily focused on the global
understanding of protein sequences regardless of their length. However, shorter
proteins (known as peptides) take on distinct structures and functions compared
to their longer counterparts. Unfortunately, there are not as many naturally
occurring peptides available to be sequenced and therefore less
peptide-specific data to train with. In this paper, we propose a new peptide
data augmentation scheme, where we train peptide language models on
artificially constructed peptides that are small contiguous subsets of longer,
wild-type proteins; we refer to the training peptides as "chopped proteins". We
evaluate the representation potential of models trained with chopped proteins
versus natural peptides and find that training language models with chopped
proteins results in more generalized embeddings for short protein sequences.
These peptide-specific models also retain information about the original
protein they were derived from better than language models trained on
full-length proteins. We compare masked language model training objectives to
three novel peptide-specific training objectives: next-peptide prediction,
contrastive peptide selection and evolution-weighted MLM. We demonstrate
improved zero-shot learning performance for a deep mutational scan peptides
benchmark.
Related papers
- Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - Unbiased organism-agnostic and highly sensitive signal peptide predictor
with deep protein language model [12.37352652557512]
Signal peptide (SP) is a short peptide located in the N-terminus of proteins.
Here we present Unbiased Organism-agnostic Signal peptide Network (USPNet), a signal peptide classification and cleavage site prediction deep learning method.
We propose to apply label distribution-aware margin loss to handle data imbalance problems and use evolutionary information of protein to enrich representation.
arXiv Detail & Related papers (2023-12-14T14:32:48Z) - ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language
Models [0.0]
In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues.
Protein language models (pLMs) have been a promising tool for protein sequence design.
We show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering.
arXiv Detail & Related papers (2023-03-29T04:35:50Z) - A Text-guided Protein Design Framework [106.79061950107922]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design.
ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation.
We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z) - Multimodal Pre-Training Model for Sequence-based Prediction of
Protein-Protein Interaction [7.022012579173686]
Pre-training a protein model to learn effective representation is critical for protein-protein interactions.
Most pre-training models for PPIs are sequence-based, which naively adopt the language models used in natural language processing to amino acid sequences.
We propose a multimodal protein pre-training model with three modalities: sequence, structure, and function.
arXiv Detail & Related papers (2021-12-09T10:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.