InstructProtein: Aligning Human and Protein Language via Knowledge
Instruction
- URL: http://arxiv.org/abs/2310.03269v1
- Date: Thu, 5 Oct 2023 02:45:39 GMT
- Title: InstructProtein: Aligning Human and Protein Language via Knowledge
Instruction
- Authors: Zeyuan Wang, Qiang Zhang, Keyan Ding, Ming Qin, Xiang Zhuang, Xiaotong
Li, Huajun Chen
- Abstract summary: Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins.
We propose InstructProtein, which possesses bidirectional generation capabilities in both human and protein languages.
InstructProtein serves as a pioneering step towards text-based protein function prediction and sequence design.
- Score: 38.46621806898224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have revolutionized the field of natural
language processing, but they fall short in comprehending biological sequences
such as proteins. To address this challenge, we propose InstructProtein, an
innovative LLM that possesses bidirectional generation capabilities in both
human and protein languages: (i) taking a protein sequence as input to predict
its textual function description and (ii) using natural language to prompt
protein sequence generation. To achieve this, we first pre-train an LLM on both
protein and natural language corpora, enabling it to comprehend individual
languages. Then supervised instruction tuning is employed to facilitate the
alignment of these two distinct languages. Herein, we introduce a knowledge
graph-based instruction generation framework to construct a high-quality
instruction dataset, addressing annotation imbalance and instruction deficits
in existing protein-text corpus. In particular, the instructions inherit the
structural relations between proteins and function annotations in knowledge
graphs, which empowers our model to engage in the causal modeling of protein
functions, akin to the chain-of-thought processes in natural languages.
Extensive experiments on bidirectional protein-text generation tasks show that
InstructProtein outperforms state-of-the-art LLMs by large margins. Moreover,
InstructProtein serves as a pioneering step towards text-based protein function
prediction and sequence design, effectively bridging the gap between protein
and human language understanding.
Related papers
- Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding [43.811432723460534]
We introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap.
Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins.
We construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model.
arXiv Detail & Related papers (2024-10-04T16:02:50Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Linguistically inspired roadmap for building biologically reliable
protein language models [0.5412332666265471]
We argue that guidance drawn from linguistics can aid with building more interpretable protein LMs.
We provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation.
arXiv Detail & Related papers (2022-07-03T08:42:44Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.