Structure-informed Language Models Are Protein Designers
- URL: http://arxiv.org/abs/2302.01649v1
- Date: Fri, 3 Feb 2023 10:49:52 GMT
- Title: Structure-informed Language Models Are Protein Designers
- Authors: Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei YE, and Quanquan
Gu
- Abstract summary: We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
- Score: 69.70134899296912
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper demonstrates that language models are strong structure-based
protein designers. We present LM-Design, a generic approach to reprogramming
sequence-based protein language models (pLMs), that have learned massive
sequential evolutionary knowledge from the universe of natural protein
sequences, to acquire an immediate capability to design preferable protein
sequences for given folds. We conduct a structural surgery on pLMs, where a
lightweight structural adapter is implanted into pLMs and endows it with
structural awareness. During inference, iterative refinement is performed to
effectively optimize the generated protein sequences. Experiments show that our
approach outperforms the state-of-the-art methods by a large margin, leading to
up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65% and 56.63% on
CATH 4.2 and 4.3 single-chain benchmarks, and >60% when designing protein
complexes). We provide extensive and in-depth analyses, which verify that
LM-Design can (1) indeed leverage both structural and sequential knowledge to
accurately handle structurally non-deterministic regions, (2) benefit from
scaling data and model size, and (3) generalize to other proteins (e.g.,
antibodies and de novo proteins)
Related papers
- Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation [7.161099050722313]
We develop a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro)
CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes.
We utilize Foldseek to encode protein structures into "structure-sequences" and trained a protein Structural Sequence Language Model, SSLM.
arXiv Detail & Related papers (2024-10-21T02:21:56Z) - Learning the Language of Protein Structure [8.364087723533537]
We introduce an approach using a vector-quantized autoencoder that effectively tokenizes protein structures into discrete representations.
To demonstrate the efficacy of our learned representations, we show that a simple GPT model trained on our codebooks can generate novel, diverse, and designable protein structures.
arXiv Detail & Related papers (2024-05-24T16:03:47Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Functional Geometry Guided Protein Sequence and Backbone Structure
Co-Design [12.585697288315846]
We propose a model to jointly design Protein sequence and structure based on automatically detected functional sites.
NAEPro is powered by an interleaving network of attention and equivariant layers, which can capture global correlation in a whole sequence.
Experimental results show that our model consistently achieves the highest amino acid recovery rate, TM-score, and the lowest RMSD among all competitors.
arXiv Detail & Related papers (2023-10-06T16:08:41Z) - CCPL: Cross-modal Contrastive Protein Learning [47.095862120116976]
We introduce a novel unsupervised protein structure representation pretraining method, cross-modal contrastive protein learning (CCPL)
CCPL leverages a robust protein language model and uses unsupervised contrastive alignment to enhance structure learning.
We evaluated our model across various benchmarks, demonstrating the framework's superiority.
arXiv Detail & Related papers (2023-03-19T08:19:10Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.