Functional Protein Design with Local Domain Alignment
- URL: http://arxiv.org/abs/2404.16866v2
- Date: Mon, 27 May 2024 07:23:26 GMT
- Title: Functional Protein Design with Local Domain Alignment
- Authors: Chaohao Yuan, Songyou Li, Geyan Ye, Yikun Zhang, Long-Kai Huang, Wenbing Huang, Wei Liu, Jianhua Yao, Yu Rong,
- Abstract summary: We propose Protein- Alignment Generation (PAAG), a multi-modality protein design framework that integrates the textual annotations extracted from protein database for controllable generation in sequence space.
Specifically, within a multi-level alignment module, PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations.
Our experimental results underscore the superiority of the aligned protein representations from PAAG over 7 prediction tasks.
- Score: 39.79713846491306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The core challenge of de novo protein design lies in creating proteins with specific functions or properties, guided by certain conditions. Current models explore to generate protein using structural and evolutionary guidance, which only provide indirect conditions concerning functions and properties. However, textual annotations of proteins, especially the annotations for protein domains, which directly describe the protein's high-level functionalities, properties, and their correlation with target amino acid sequences, remain unexplored in the context of protein design tasks. In this paper, we propose Protein-Annotation Alignment Generation (PAAG), a multi-modality protein design framework that integrates the textual annotations extracted from protein database for controllable generation in sequence space. Specifically, within a multi-level alignment module, PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations, and can even design novel proteins with flexible combinations of different kinds of annotations. Our experimental results underscore the superiority of the aligned protein representations from PAAG over 7 prediction tasks. Furthermore, PAAG demonstrates a nearly sixfold increase in generation success rate (24.7% vs 4.7% in zinc finger, and 54.3% vs 8.7% in the immunoglobulin domain) in comparison to the existing model.
Related papers
- ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding [22.610060675922536]
We introduce ProteinGPT, a state-of-the-art multi-modal protein chat system.
ProteinGPT seamlessly integrates protein sequence and structure encoders with linear projection layers for precise representation adaptation.
We train a large-scale dataset of 132,092 proteins with annotations, and optimize the instruction-tuning process using GPT-4o.
Experiments show that ProteinGPT can produce promising responses to proteins and their corresponding questions.
arXiv Detail & Related papers (2024-08-21T06:16:22Z) - A PLMs based protein retrieval framework [4.110243520064533]
We propose a novel protein retrieval framework that mitigates the bias towards sequence similarity.
Our framework initiatively harnesses protein language models (PLMs) to embed protein sequences within a high-dimensional feature space.
Extensive experiments demonstrate that our framework can equally retrieve both similar and dissimilar proteins.
arXiv Detail & Related papers (2024-07-16T09:52:42Z) - ProtFAD: Introducing function-aware domains as implicit modality towards protein function perception [0.3928425951824076]
We propose a function-aware domain representation and a domain-joint contrastive learning strategy to distinguish different protein functions.
Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks.
arXiv Detail & Related papers (2024-05-24T02:26:45Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - A Text-guided Protein Design Framework [106.79061950107922]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design.
ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation.
We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Generative De Novo Protein Design with Global Context [36.21545615114117]
The inverse of protein structure prediction aims to obtain a novel protein sequence that will fold into the defined structure.
Recent works on computational protein design have studied designing sequences for the desired backbone structure with local positional information.
We propose the Global-Context Aware generative de novo protein design method (GCA), consisting of local and global modules.
arXiv Detail & Related papers (2022-04-21T02:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.