Annotation-guided Protein Design with Multi-Level Domain Alignment
- URL: http://arxiv.org/abs/2404.16866v4
- Date: Thu, 12 Dec 2024 07:05:53 GMT
- Title: Annotation-guided Protein Design with Multi-Level Domain Alignment
- Authors: Chaohao Yuan, Songyou Li, Geyan Ye, Yikun Zhang, Long-Kai Huang, Wenbing Huang, Wei Liu, Jianhua Yao, Yu Rong,
- Abstract summary: We propose Protein- Alignment Generation, PAAG, a multi-modality protein design framework.
It integrates the textual annotations extracted from protein database for controllable generation in sequence space.
Specifically, PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations.
- Score: 39.79713846491306
- License:
- Abstract: The core challenge of de novo protein design lies in creating proteins with specific functions or properties, guided by certain conditions. Current models explore to generate protein using structural and evolutionary guidance, which only provide indirect conditions concerning functions and properties. However, textual annotations of proteins, especially the annotations for protein domains, which directly describe the protein's high-level functionalities, properties, and their correlation with target amino acid sequences, remain unexplored in the context of protein design tasks. In this paper, we propose Protein-Annotation Alignment Generation, PAAG, a multi-modality protein design framework that integrates the textual annotations extracted from protein database for controllable generation in sequence space. Specifically, within a multi-level alignment module, PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations, and can even design novel proteins with flexible combinations of different kinds of annotations. Our experimental results underscore the superiority of the aligned protein representations from PAAG over 7 prediction tasks. Furthermore, PAAG demonstrates a significant increase in generation success rate (24.7% vs 4.7% in zinc finger, and 54.3% vs 22.0% in the immunoglobulin domain) in comparison to the existing model. We anticipate that PAAG will broaden the horizons of protein design by leveraging the knowledge from between textual annotation and proteins.
Related papers
- EvoLlama: Enhancing LLMs' Understanding of Proteins via Multimodal Structure and Sequence Representations [28.298740080002077]
Current Large Language Models (LLMs) for understanding proteins primarily treats amino acid sequences as a text modality.
EvoLlama is a framework that connects a structure-based encoder, a sequence-based protein encoder and an LLM for protein understanding.
Our experiments show that EvoLlama's protein understanding capabilities have been significantly enhanced.
arXiv Detail & Related papers (2024-12-16T10:01:33Z) - ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description [7.198238666986253]
We propose a de novo fine-grained framework capable of designing proteins from any descriptive text input.
Prot DAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities.
Experimental results demonstrate that Prot DAT achieves the state-of-the-art performance in protein sequence generation, excelling in rationality, functionality, structural similarity, and validity.
arXiv Detail & Related papers (2024-12-05T11:05:46Z) - ProteinWeaver: A Divide-and-Assembly Approach for Protein Backbone Design [61.19456204667385]
We introduce ProteinWeaver, a two-stage framework for protein backbone design.
ProteinWeaver generates high-quality, novel protein backbones through versatile domain assembly.
By introducing a divide-and-assembly' paradigm, ProteinWeaver advances protein engineering and opens new avenues for functional protein design.
arXiv Detail & Related papers (2024-11-08T08:10:49Z) - ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding [22.610060675922536]
We introduce ProteinGPT, a state-of-the-art multi-modal protein chat system.
ProteinGPT seamlessly integrates protein sequence and structure encoders with linear projection layers for precise representation adaptation.
We train a large-scale dataset of 132,092 proteins with annotations, and optimize the instruction-tuning process using GPT-4o.
Experiments show that ProteinGPT can produce promising responses to proteins and their corresponding questions.
arXiv Detail & Related papers (2024-08-21T06:16:22Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - A Text-guided Protein Design Framework [106.79061950107922]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design.
ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation.
We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.