Endowing Protein Language Models with Structural Knowledge
- URL: http://arxiv.org/abs/2401.14819v1
- Date: Fri, 26 Jan 2024 12:47:54 GMT
- Title: Endowing Protein Language Models with Structural Knowledge
- Authors: Dexiong Chen, Philip Hartout, Paolo Pellizzoni, Carlos Oliver, Karsten
Borgwardt
- Abstract summary: We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
- Score: 5.587293092389789
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the relationships between protein sequence, structure and
function is a long-standing biological challenge with manifold implications
from drug design to our understanding of evolution. Recently, protein language
models have emerged as the preferred method for this challenge, thanks to their
ability to harness large sequence databases. Yet, their reliance on expansive
sequence data and parameter sets limits their flexibility and practicality in
real-world scenarios. Concurrently, the recent surge in computationally
predicted protein structures unlocks new opportunities in protein
representation learning. While promising, the computational burden carried by
such complex data still hinders widely-adopted practical applications. To
address these limitations, we introduce a novel framework that enhances protein
language models by integrating protein structural data. Drawing from recent
advances in graph transformers, our approach refines the self-attention
mechanisms of pretrained language transformers by integrating structural
information with structure extractor modules. This refined model, termed
Protein Structure Transformer (PST), is further pretrained on a small protein
structure database, using the same masked language modeling objective as
traditional protein language models. Empirical evaluations of PST demonstrate
its superior parameter efficiency relative to protein language models, despite
being pretrained on a dataset comprising only 542K structures. Notably, PST
consistently outperforms the state-of-the-art foundation model for protein
sequences, ESM-2, setting a new benchmark in protein function prediction. Our
findings underscore the potential of integrating structural information into
protein language models, paving the way for more effective and efficient
protein modeling Code and pretrained models are available at
https://github.com/BorgwardtLab/PST.
Related papers
- SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation [7.161099050722313]
We develop a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro)
CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes.
We utilize Foldseek to encode protein structures into "structure-sequences" and trained a protein Structural Sequence Language Model, SSLM.
arXiv Detail & Related papers (2024-10-21T02:21:56Z) - Structure-Informed Protein Language Model [38.019425619750265]
We introduce the integration of remote homology detection to distill structural information into protein language models.
We evaluate the impact of this structure-informed training on downstream protein function prediction tasks.
arXiv Detail & Related papers (2024-02-07T09:32:35Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - CCPL: Cross-modal Contrastive Protein Learning [47.095862120116976]
We introduce a novel unsupervised protein structure representation pretraining method, cross-modal contrastive protein learning (CCPL)
CCPL leverages a robust protein language model and uses unsupervised contrastive alignment to enhance structure learning.
We evaluated our model across various benchmarks, demonstrating the framework's superiority.
arXiv Detail & Related papers (2023-03-19T08:19:10Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - PSP: Million-level Protein Sequence Dataset for Protein Structure
Prediction [34.11168458572554]
We present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP.
This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB)
We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset.
arXiv Detail & Related papers (2022-06-24T14:08:44Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.