Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion
Trajectory Prediction
- URL: http://arxiv.org/abs/2301.12068v2
- Date: Sat, 8 Jul 2023 14:46:58 GMT
- Title: Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion
Trajectory Prediction
- Authors: Zuobai Zhang, Minghao Xu, Aur\'elie Lozano, Vijil Chenthamarakshan,
Payel Das, Jian Tang
- Abstract summary: Self-supervised pre-training methods on proteins have recently gained attention, with most approaches focusing on either protein sequences or structures.
We propose the DiffPreT approach to pre-train a protein encoder by sequence-structure joint diffusion modeling.
We enhance DiffPreT by a method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein.
- Score: 29.375830561817047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pre-training methods on proteins have recently gained
attention, with most approaches focusing on either protein sequences or
structures, neglecting the exploration of their joint distribution, which is
crucial for a comprehensive understanding of protein functions by integrating
co-evolutionary information and structural characteristics. In this work,
inspired by the success of denoising diffusion models in generative tasks, we
propose the DiffPreT approach to pre-train a protein encoder by
sequence-structure joint diffusion modeling. DiffPreT guides the encoder to
recover the native protein sequences and structures from the perturbed ones
along the joint diffusion trajectory, which acquires the joint distribution of
sequences and structures. Considering the essential protein conformational
variations, we enhance DiffPreT by a method called Siamese Diffusion Trajectory
Prediction (SiamDiff) to capture the correlation between different conformers
of a protein. SiamDiff attains this goal by maximizing the mutual information
between representations of diffusion trajectories of structurally-correlated
conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom-
and residue-level structure-based protein understanding tasks. Experimental
results show that the performance of DiffPreT is consistently competitive on
all tasks, and SiamDiff achieves new state-of-the-art performance, considering
the mean ranks on all tasks. Our implementation is available at
https://github.com/DeepGraphLearning/SiamDiff.
Related papers
- Mask prior-guided denoising diffusion improves inverse protein folding [3.1373465343833704]
Inverse protein folding generates valid amino acid sequences that can fold into a desired protein structure.
We propose a framework that captures both structural and residue interactions for inverse protein folding.
MapDiff is a discrete diffusion probabilistic model that iteratively generates amino acid sequences with reduced noise.
arXiv Detail & Related papers (2024-12-10T09:10:28Z) - Multi-Scale Representation Learning for Protein Fitness Prediction [31.735234482320283]
Previous methods have primarily relied on self-supervised models trained on vast, unlabeled protein sequence or structure datasets.
We introduce the Sequence-Structure-Surface Fitness (S3F) model - a novel multimodal representation learning framework that integrates protein features across several scales.
Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology.
arXiv Detail & Related papers (2024-12-02T04:28:10Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - ProtFAD: Introducing function-aware domains as implicit modality towards protein function prediction [4.299777426056576]
We propose a function-aware domain representation and a domain-joint contrastive learning strategy to distinguish different protein functions.
Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks.
arXiv Detail & Related papers (2024-05-24T02:26:45Z) - PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for
Efficient and Generalizable Compound-Protein Interaction Prediction [63.50967073653953]
Compound-Protein Interaction prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery.
Existing deep learning-based methods utilize only the single modality of protein sequences or structures.
We propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction.
arXiv Detail & Related papers (2024-02-13T03:51:10Z) - Predicting mutational effects on protein-protein binding via a
side-chain diffusion probabilistic model [14.949807579474781]
We propose SidechainDiff, a representation learning-based approach that leverages unlabelled experimental protein structures.
SidechainDiff is the first diffusion-based generative model for side-chains, distinguishing it from prior efforts that have predominantly focused on generating protein backbone structures.
arXiv Detail & Related papers (2023-10-30T15:23:42Z) - DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models [47.73386438748902]
DiffDock-PP is a diffusion generative model that learns to translate and rotate unbound protein structures into their bound conformations.
We achieve state-of-the-art performance on DIPS with a median C-RMSD of 4.85, outperforming all considered baselines.
arXiv Detail & Related papers (2023-04-08T02:10:44Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.