FoldToken2: Learning compact, invariant and generative protein structure language
- URL: http://arxiv.org/abs/2407.00050v1
- Date: Tue, 11 Jun 2024 09:24:51 GMT
- Title: FoldToken2: Learning compact, invariant and generative protein structure language
- Authors: Zhangyang Gao, Cheng Tan, Stan Z. Li,
- Abstract summary: We propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures.
We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20% in TMScore and 81% in RMSD.
We believe that FoldToken2 will inspire further improvement in protein structure representation learning, structure alignment, and structure generation tasks.
- Score: 48.1647245005672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The equivalent nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures. From FoldToken1 to FoldToken2, we improve three key components: (1) invariant structure encoder, (2) vector-quantized compressor, and (3) equivalent structure decoder. We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20\% in TMScore and 81\% in RMSD. FoldToken2 probably be the first method that works well on both single-chain and multi-chain protein structures quantization. We believe that FoldToken2 will inspire further improvement in protein structure representation learning, structure alignment, and structure generation tasks.
Related papers
- A Protein Structure Prediction Approach Leveraging Transformer and CNN
Integration [4.909112037834705]
This paper adopts a two-dimensional fusion deep neural network model, DstruCCN, which uses Convolutional Neural Networks (CCN) and a supervised Transformer protein language model for single-sequence protein structure prediction.
The training features of the two are combined to predict the protein Transformer binding site matrix, and then the three-dimensional structure is reconstructed using energy minimization.
arXiv Detail & Related papers (2024-02-29T12:24:20Z) - FoldToken: Learning Protein Language via Vector Quantization and Beyond [56.19308144551836]
We introduce textbfFoldTokenizer to represent protein sequence-structure as discrete symbols.
We refer to the learned symbols as textbfFoldToken, and the sequence of FoldTokens serves as a new protein language.
arXiv Detail & Related papers (2024-02-04T12:18:51Z) - Promptly Predicting Structures: The Return of Inference [31.442123334313035]
We present a framework for constructing zero- and few-shot linguistic structure predictors.
Our results show that enforcing consistency constructs not only structurally valid outputs, but also improves performance.
arXiv Detail & Related papers (2024-01-12T20:08:39Z) - StructRe: Rewriting for Structured Shape Modeling [63.792684115318906]
We present StructRe, a structure rewriting system, as a novel approach to structured shape modeling.
Given a 3D object represented by points and components, StructRe can rewrite it upward into more concise structures, or downward into more detailed structures.
arXiv Detail & Related papers (2023-11-29T10:35:00Z) - FFF: Fragments-Guided Flexible Fitting for Building Complete Protein
Structures [10.682516227941592]
We propose a new method named FFF that bridges protein structure prediction and protein structure recognition with flexible fitting.
First, a multi-level recognition network is used to capture various structural features from the input 3D cryo-EM map.
Next, protein structural fragments are generated using pseudo peptide vectors and a protein sequence alignment method based on these extracted features.
arXiv Detail & Related papers (2023-08-07T15:10:21Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model
for Protein Design [70.27706384570723]
We propose Fold2Seq, a novel framework for designing protein sequences conditioned on a specific target fold.
We show improved or comparable performance of Fold2Seq in terms of speed, coverage, and reliability for sequence design.
The unique advantages of fold-based Fold2Seq, in comparison to a structure-based deep model and RosettaDesign, become more evident on three additional real-world challenges.
arXiv Detail & Related papers (2021-06-24T14:34:24Z) - BERTology Meets Biology: Interpreting Attention in Protein Language
Models [124.8966298974842]
We demonstrate methods for analyzing protein Transformer models through the lens of attention.
We show that attention captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure.
We also present a three-dimensional visualization of the interaction between attention and protein structure.
arXiv Detail & Related papers (2020-06-26T21:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.