Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure
- URL: http://arxiv.org/abs/2502.06846v1
- Date: Fri, 07 Feb 2025 05:23:16 GMT
- Title: Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure
- Authors: Zhicong Wang, Zicheng Ma, Ziqiang Cao, Changlong Zhou, Jun Zhang, Yiqin Gao,
- Abstract summary: Prot2Chat is a novel framework that integrates multimodal protein representations with natural language through a unified module.
Our model incorporates a modified ProteinMPNN encoder, which encodes protein sequence and structural information in a unified manner, and a protein-text adapter with cross-attention mechanisms.
- Score: 7.9473027178525975
- License:
- Abstract: Proteins play a pivotal role in living organisms, yet understanding their functions presents significant challenges, including the limited flexibility of classification-based methods, the inability to effectively leverage spatial structural information, and the lack of systematic evaluation metrics for protein Q&A systems. To address these limitations, we propose Prot2Chat, a novel framework that integrates multimodal protein representations with natural language through a unified module, enabling large language model (LLM)-driven answer generation. Our model incorporates a modified ProteinMPNN encoder, which encodes protein sequence and structural information in a unified manner, a protein-text adapter with cross-attention mechanisms, and a LLaMA3 decoder. To optimize training efficiency, we freeze the encoder and employ LoRA techniques for the decoder. We conducted experiments on two datasets, both automated metrics and expert evaluations demonstrate the superior performance of our model. Furthermore, zero-shot prediction results highlight its strong generalization capabilities. This framework offers a promising solution for bridging protein domain knowledge with natural language understanding, paving the way for transformative advancements in protein-related research.
Related papers
- Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.
Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z) - EvoLlama: Enhancing LLMs' Understanding of Proteins via Multimodal Structure and Sequence Representations [28.298740080002077]
Current Large Language Models (LLMs) for understanding proteins primarily treats amino acid sequences as a text modality.
EvoLlama is a framework that connects a structure-based encoder, a sequence-based protein encoder and an LLM for protein understanding.
Our experiments show that EvoLlama's protein understanding capabilities have been significantly enhanced.
arXiv Detail & Related papers (2024-12-16T10:01:33Z) - Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding [43.811432723460534]
We introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap.
Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins.
We construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model.
arXiv Detail & Related papers (2024-10-04T16:02:50Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [74.64101864289572]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers [18.498779242323582]
We propose a novel approach, Prot2Text, which predicts a protein's function in a free text style.
By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types.
arXiv Detail & Related papers (2023-07-25T09:35:43Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.