ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language
Models
- URL: http://arxiv.org/abs/2303.16452v1
- Date: Wed, 29 Mar 2023 04:35:50 GMT
- Title: ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language
Models
- Authors: Youhan Lee, Hasun Yu
- Abstract summary: In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues.
Protein language models (pLMs) have been a promising tool for protein sequence design.
We show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protein language models (pLMs), pre-trained via causal language modeling on
protein sequences, have been a promising tool for protein sequence design. In
real-world protein engineering, there are many cases where the amino acids in
the middle of a protein sequence are optimized while maintaining other
residues. Unfortunately, because of the left-to-right nature of pLMs, existing
pLMs modify suffix residues by prompting prefix residues, which are
insufficient for the infilling task that considers the whole surrounding
context. To find the more effective pLMs for protein engineering, we design a
new benchmark, Secondary structureE InFilling rEcoveRy, SEIFER, which
approximates infilling sequence design scenarios. With the evaluation of
existing models on the benchmark, we reveal the weakness of existing language
models and show that language models trained via fill-in-middle transformation,
called ProtFIM, are more appropriate for protein engineering. Also, we prove
that ProtFIM generates protein sequences with decent protein representations
through exhaustive experiments and visualizations.
Related papers
- Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding [10.652670673334486]
ProteinLMBench is the first benchmark dataset consisting of 944 manually verified multiple-choice questions for assessing the protein understanding capabilities of LLMs.
ProteinLMDataset is a dataset specifically designed for further self-supervised pretraining and supervised fine-tuning.
InternLM2-7B, pretrained and fine-tuned on the ProteinLMDataset, outperforms GPT-4 on ProteinLMBench, achieving the highest accuracy score.
arXiv Detail & Related papers (2024-06-08T18:11:30Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.