Lightweight Contrastive Protein Structure-Sequence Transformation
- URL: http://arxiv.org/abs/2303.11783v1
- Date: Sun, 19 Mar 2023 08:19:10 GMT
- Title: Lightweight Contrastive Protein Structure-Sequence Transformation
- Authors: Jiangbin Zheng, Ge Wang, Yufei Huang, Bozhen Hu, Siyuan Li, Cheng Tan,
Xinwen Fan, Stan Z. Li
- Abstract summary: We introduce a novel unsupervised protein structure representation pretraining with a robust protein language model.
In particular, we first propose to leverage an existing pretrained language model to guide structure model learning.
With only light training data, the pretrained structure model can obtain better generalization ability.
- Score: 40.983513907321615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained protein structure models without labels are crucial foundations
for the majority of protein downstream applications. The conventional structure
pretraining methods follow the mature natural language pretraining methods such
as denoised reconstruction and masked language modeling but usually destroy the
real representation of spatial structures. The other common pretraining methods
might predict a fixed set of predetermined object categories, where a
restricted supervised manner limits their generality and usability as
additional labeled data is required to specify any other protein concepts. In
this work, we introduce a novel unsupervised protein structure representation
pretraining with a robust protein language model. In particular, we first
propose to leverage an existing pretrained language model to guide structure
model learning through an unsupervised contrastive alignment. In addition, a
self-supervised structure constraint is proposed to further learn the intrinsic
information about the structures. With only light training data, the pretrained
structure model can obtain better generalization ability. To quantitatively
evaluate the proposed structure models, we design a series of rational
evaluation methods, including internal tasks (e.g., contact map prediction,
distribution alignment quality) and external/downstream tasks (e.g., protein
design). The extensive experimental results conducted on multiple tasks and
specific datasets demonstrate the superiority of the proposed
sequence-structure transformation framework.
Related papers
- Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance? [4.7077642423577775]
We propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation.
Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains.
arXiv Detail & Related papers (2024-06-28T08:54:37Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - A Systematic Study of Joint Representation Learning on Protein Sequences
and Structures [38.94729758958265]
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions.
Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge.
Our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM with distinct structure encoders.
arXiv Detail & Related papers (2023-03-11T01:24:10Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Contrastive Representation Learning for 3D Protein Structures [13.581113136149469]
We introduce a new representation learning framework for 3D protein structures.
Our framework uses unsupervised contrastive learning to learn meaningful representations of protein structures.
We show, how these representations can be used to solve a large variety of tasks, such as protein function prediction, protein fold classification, structural similarity prediction, and protein-ligand binding affinity prediction.
arXiv Detail & Related papers (2022-05-31T10:33:06Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Transfer Learning for Protein Structure Classification at Low Resolution [124.5573289131546]
We show that it is possible to make accurate ($geq$80%) predictions of protein class and architecture from structures determined at low ($leq$3A) resolution.
We provide proof of concept for high-speed, low-cost protein structure classification at low resolution, and a basis for extension to prediction of function.
arXiv Detail & Related papers (2020-08-11T15:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.