Related papers: Multi-Modal CLIP-Informed Protein Editing

Multi-Modal CLIP-Informed Protein Editing

URL: http://arxiv.org/abs/2407.19296v1
Date: Sat, 27 Jul 2024 16:41:08 GMT
Title: Multi-Modal CLIP-Informed Protein Editing
Authors: Mingze Yin, Hanjing Zhou, Yiheng Zhu, Miao Lin, Yixuan Wu, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jintai Chen, Jian Wu,
Abstract summary: We propose a novel method called ProtET for efficient CLIP-informed protein editing through multi-modality learning. Our approach comprises two stages: in the pretraining stage, contrastive learning aligns protein-biotext representations encoded by two large language models (LLMs) During the protein editing stage, the fused features from editing instruction texts and original protein sequences serve as the final editing condition for generating target protein sequences.
Score: 8.927362207499181
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Proteins govern most biological functions essential for life, but achieving controllable protein discovery and optimization remains challenging. Recently, machine learning-assisted protein editing (MLPE) has shown promise in accelerating optimization cycles and reducing experimental workloads. However, current methods struggle with the vast combinatorial space of potential protein edits and cannot explicitly conduct protein editing using biotext instructions, limiting their interactivity with human feedback. To fill these gaps, we propose a novel method called ProtET for efficient CLIP-informed protein editing through multi-modality learning. Our approach comprises two stages: in the pretraining stage, contrastive learning aligns protein-biotext representations encoded by two large language models (LLMs), respectively. Subsequently, during the protein editing stage, the fused features from editing instruction texts and original protein sequences serve as the final editing condition for generating target protein sequences. Comprehensive experiments demonstrated the superiority of ProtET in editing proteins to enhance human-expected functionality across multiple attribute domains, including enzyme catalytic activity, protein stability and antibody specific binding ability. And ProtET improves the state-of-the-art results by a large margin, leading to significant stability improvements of 16.67% and 16.90%. This capability positions ProtET to advance real-world artificial protein editing, potentially addressing unmet academic, industrial, and clinical needs.

Related papers

Protein Design with Dynamic Protein Vocabulary [22.358650729894443]
We introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments.<n>Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins.
arXiv Detail & Related papers (2025-05-25T03:50:50Z)
Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm. Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z)
EvoLlama: Enhancing LLMs' Understanding of Proteins via Multimodal Structure and Sequence Representations [28.298740080002077]
Current Large Language Models (LLMs) for understanding proteins primarily treats amino acid sequences as a text modality. EvoLlama is a framework that connects a structure-based encoder, a sequence-based protein encoder and an LLM for protein understanding. Our experiments show that EvoLlama's protein understanding capabilities have been significantly enhanced.
arXiv Detail & Related papers (2024-12-16T10:01:33Z)
ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description [7.198238666986253]
We propose a de novo fine-grained framework capable of designing proteins from any descriptive text input. Prot DAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities. Experimental results demonstrate that Prot DAT achieves the state-of-the-art performance in protein sequence generation, excelling in rationality, functionality, structural similarity, and validity.
arXiv Detail & Related papers (2024-12-05T11:05:46Z)
Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design. Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths. We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models. We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z)
TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering [21.963312554645924]
TourSynbio-7B is a large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-Agent is a framework capable of performing various protein engineering tasks, including mutation analysis, inverse folding, protein folding, and visualization.
arXiv Detail & Related papers (2024-08-27T13:36:00Z)
ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions. We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z)
Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New Directions [58.819567030843025]
This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks. We propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution. Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA)
arXiv Detail & Related papers (2024-03-01T07:58:29Z)
ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs. By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z)
Efficiently Predicting Mutational Effect on Homologous Proteins by Evolution Encoding [7.067145619709089]
EvolMPNN is an efficient model to learn evolution-aware protein embeddings. Our model shows up to 6.4% better than state-of-the-art methods and attains 36X inference speedup.
arXiv Detail & Related papers (2024-02-20T23:06:21Z)
Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry. We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z)
Multi-level Protein Representation Learning for Blind Mutational Effect Prediction [5.207307163958806]
This paper introduces a novel pre-training framework that cascades sequential and geometric analyzers for protein structures. It guides mutational directions toward desired traits by simulating natural selection on wild-type proteins. We assess the proposed approach using a public database and two new databases for a variety of variant effect prediction tasks.
arXiv Detail & Related papers (2023-06-08T03:00:50Z)
A Text-guided Protein Design Framework [106.79061950107922]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.