ProtST: Multi-Modality Learning of Protein Sequences and Biomedical
Texts
- URL: http://arxiv.org/abs/2301.12040v2
- Date: Wed, 5 Jul 2023 03:17:48 GMT
- Title: ProtST: Multi-Modality Learning of Protein Sequences and Biomedical
Texts
- Authors: Minghao Xu, Xinyu Yuan, Santiago Miret, Jian Tang
- Abstract summary: We build a ProtST dataset to augment protein sequences with text descriptions of their functions and other important properties.
During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction.
On downstream tasks, ProtST enables both supervised learning and zero-shot prediction.
- Score: 22.870765825298268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current protein language models (PLMs) learn protein representations mainly
based on their sequences, thereby well capturing co-evolutionary information,
but they are unable to explicitly acquire protein functions, which is the end
goal of protein representation learning. Fortunately, for many proteins, their
textual property descriptions are available, where their various functions are
also described. Motivated by this fact, we first build the ProtDescribe dataset
to augment protein sequences with text descriptions of their functions and
other important properties. Based on this dataset, we propose the ProtST
framework to enhance Protein Sequence pre-training and understanding by
biomedical Texts. During pre-training, we design three types of tasks, i.e.,
unimodal mask prediction, multimodal representation alignment and multimodal
mask prediction, to enhance a PLM with protein property information with
different granularities and, at the same time, preserve the PLM's original
representation power. On downstream tasks, ProtST enables both supervised
learning and zero-shot prediction. We verify the superiority of ProtST-induced
PLMs over previous ones on diverse representation learning benchmarks. Under
the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein
classification, and ProtST also enables functional protein retrieval from a
large-scale database without any function annotation.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.