ProtST: Multi-Modality Learning of Protein Sequences and Biomedical
Texts
- URL: http://arxiv.org/abs/2301.12040v2
- Date: Wed, 5 Jul 2023 03:17:48 GMT
- Title: ProtST: Multi-Modality Learning of Protein Sequences and Biomedical
Texts
- Authors: Minghao Xu, Xinyu Yuan, Santiago Miret, Jian Tang
- Abstract summary: We build a ProtST dataset to augment protein sequences with text descriptions of their functions and other important properties.
During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction.
On downstream tasks, ProtST enables both supervised learning and zero-shot prediction.
- Score: 22.870765825298268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current protein language models (PLMs) learn protein representations mainly
based on their sequences, thereby well capturing co-evolutionary information,
but they are unable to explicitly acquire protein functions, which is the end
goal of protein representation learning. Fortunately, for many proteins, their
textual property descriptions are available, where their various functions are
also described. Motivated by this fact, we first build the ProtDescribe dataset
to augment protein sequences with text descriptions of their functions and
other important properties. Based on this dataset, we propose the ProtST
framework to enhance Protein Sequence pre-training and understanding by
biomedical Texts. During pre-training, we design three types of tasks, i.e.,
unimodal mask prediction, multimodal representation alignment and multimodal
mask prediction, to enhance a PLM with protein property information with
different granularities and, at the same time, preserve the PLM's original
representation power. On downstream tasks, ProtST enables both supervised
learning and zero-shot prediction. We verify the superiority of ProtST-induced
PLMs over previous ones on diverse representation learning benchmarks. Under
the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein
classification, and ProtST also enables functional protein retrieval from a
large-scale database without any function annotation.
Related papers
- Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.
Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z) - EvoLlama: Enhancing LLMs' Understanding of Proteins via Multimodal Structure and Sequence Representations [28.298740080002077]
Current Large Language Models (LLMs) for understanding proteins primarily treats amino acid sequences as a text modality.
EvoLlama is a framework that connects a structure-based encoder, a sequence-based protein encoder and an LLM for protein understanding.
Our experiments show that EvoLlama's protein understanding capabilities have been significantly enhanced.
arXiv Detail & Related papers (2024-12-16T10:01:33Z) - Multi-modal Representation Learning Enables Accurate Protein Function Prediction in Low-Data Setting [0.0]
HOPER (HOlistic ProtEin Representation) is a novel framework designed to enhance protein function prediction (PFP) in low-data settings.
Our results highlight the effectiveness of multimodal representation learning for overcoming data limitations in biological research.
arXiv Detail & Related papers (2024-11-22T20:13:55Z) - A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding [10.652670673334486]
ProteinLMBench is the first benchmark dataset consisting of 944 manually verified multiple-choice questions for assessing the protein understanding capabilities of LLMs.
ProteinLMDataset is a dataset specifically designed for further self-supervised pretraining and supervised fine-tuning.
InternLM2-7B, pretrained and fine-tuned on the ProteinLMDataset, outperforms GPT-4 on ProteinLMBench, achieving the highest accuracy score.
arXiv Detail & Related papers (2024-06-08T18:11:30Z) - ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [74.64101864289572]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.