ProtST: Multi-Modality Learning of Protein Sequences and Biomedical
Texts
- URL: http://arxiv.org/abs/2301.12040v2
- Date: Wed, 5 Jul 2023 03:17:48 GMT
- Title: ProtST: Multi-Modality Learning of Protein Sequences and Biomedical
Texts
- Authors: Minghao Xu, Xinyu Yuan, Santiago Miret, Jian Tang
- Abstract summary: We build a ProtST dataset to augment protein sequences with text descriptions of their functions and other important properties.
During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction.
On downstream tasks, ProtST enables both supervised learning and zero-shot prediction.
- Score: 22.870765825298268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current protein language models (PLMs) learn protein representations mainly
based on their sequences, thereby well capturing co-evolutionary information,
but they are unable to explicitly acquire protein functions, which is the end
goal of protein representation learning. Fortunately, for many proteins, their
textual property descriptions are available, where their various functions are
also described. Motivated by this fact, we first build the ProtDescribe dataset
to augment protein sequences with text descriptions of their functions and
other important properties. Based on this dataset, we propose the ProtST
framework to enhance Protein Sequence pre-training and understanding by
biomedical Texts. During pre-training, we design three types of tasks, i.e.,
unimodal mask prediction, multimodal representation alignment and multimodal
mask prediction, to enhance a PLM with protein property information with
different granularities and, at the same time, preserve the PLM's original
representation power. On downstream tasks, ProtST enables both supervised
learning and zero-shot prediction. We verify the superiority of ProtST-induced
PLMs over previous ones on diverse representation learning benchmarks. Under
the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein
classification, and ProtST also enables functional protein retrieval from a
large-scale database without any function annotation.
Related papers
- A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding [10.652670673334486]
ProteinLMBench is the first benchmark dataset consisting of 944 manually verified multiple-choice questions for assessing the protein understanding capabilities of LLMs.
ProteinLMDataset is a dataset specifically designed for further self-supervised pretraining and supervised fine-tuning.
InternLM2-7B, pretrained and fine-tuned on the ProteinLMDataset, outperforms GPT-4 on ProteinLMBench, achieving the highest accuracy score.
arXiv Detail & Related papers (2024-06-08T18:11:30Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers [18.498779242323582]
We propose a novel approach, Prot2Text, which predicts a protein's function in a free text style.
By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types.
arXiv Detail & Related papers (2023-07-25T09:35:43Z) - Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - Profile Prediction: An Alignment-Based Pre-Training Task for Protein
Sequence Models [11.483725773928382]
Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks.
We introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments.
Our results suggest that protein sequence models may benefit from leveraging biologically-inspired inductive biases.
arXiv Detail & Related papers (2020-12-01T01:01:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.