PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for
Efficient and Generalizable Compound-Protein Interaction Prediction
- URL: http://arxiv.org/abs/2402.08198v1
- Date: Tue, 13 Feb 2024 03:51:10 GMT
- Title: PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for
Efficient and Generalizable Compound-Protein Interaction Prediction
- Authors: Lirong Wu, Yufei Huang, Cheng Tan, Zhangyang Gao, Bozhen Hu, Haitao
Lin, Zicheng Liu, Stan Z. Li
- Abstract summary: Compound-Protein Interaction prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery.
Existing deep learning-based methods utilize only the single modality of protein sequences or structures.
We propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction.
- Score: 63.50967073653953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compound-Protein Interaction (CPI) prediction aims to predict the pattern and
strength of compound-protein interactions for rational drug discovery. Existing
deep learning-based methods utilize only the single modality of protein
sequences or structures and lack the co-modeling of the joint distribution of
the two modalities, which may lead to significant performance drops in complex
real-world scenarios due to various factors, e.g., modality missing and domain
shifting. More importantly, these methods only model protein sequences and
structures at a single fixed scale, neglecting more fine-grained multi-scale
information, such as those embedded in key protein fragments. In this paper, we
propose a novel multi-scale Protein Sequence-structure Contrasting framework
for CPI prediction (PSC-CPI), which captures the dependencies between protein
sequences and structures through both intra-modality and cross-modality
contrasting. We further apply length-variable protein augmentation to allow
contrasting to be performed at different scales, from the amino acid level to
the sequence level. Finally, in order to more fairly evaluate the model
generalizability, we split the test data into four settings based on whether
compounds and proteins have been observed during the training stage. Extensive
experiments have shown that PSC-CPI generalizes well in all four settings,
particularly in the more challenging ``Unseen-Both" setting, where neither
compounds nor proteins have been observed during training. Furthermore, even
when encountering a situation of modality missing, i.e., inference with only
single-modality protein data, PSC-CPI still exhibits comparable or even better
performance than previous approaches.
Related papers
- Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Pairing interacting protein sequences using masked language modeling [0.3222802562733787]
We develop a method to pair interacting protein sequences using protein language models trained on sequence alignments.
We exploit the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context.
We show that it captures inter-chain coevolution while it was trained on single-chain data, which means that it can be used out-of-distribution.
arXiv Detail & Related papers (2023-08-14T13:42:09Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Protein Sequence and Structure Co-Design with Equivariant Translation [19.816174223173494]
Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models.
We propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state.
Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features.
All protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process.
arXiv Detail & Related papers (2022-10-17T06:00:12Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - Multimodal Pre-Training Model for Sequence-based Prediction of
Protein-Protein Interaction [7.022012579173686]
Pre-training a protein model to learn effective representation is critical for protein-protein interactions.
Most pre-training models for PPIs are sequence-based, which naively adopt the language models used in natural language processing to amino acid sequences.
We propose a multimodal protein pre-training model with three modalities: sequence, structure, and function.
arXiv Detail & Related papers (2021-12-09T10:21:52Z) - Independent SE(3)-Equivariant Models for End-to-End Rigid Protein
Docking [57.2037357017652]
We tackle rigid body protein-protein docking, i.e., computationally predicting the 3D structure of a protein-protein complex from the individual unbound structures.
We design a novel pairwise-independent SE(3)-equivariant graph matching network to predict the rotation and translation to place one of the proteins at the right docked position.
Our model, named EquiDock, approximates the binding pockets and predicts the docking poses using keypoint matching and alignment.
arXiv Detail & Related papers (2021-11-15T18:46:37Z) - Cross-Modality Protein Embedding for Compound-Protein Affinity and
Contact Prediction [15.955668586941472]
We consider proteins as multi-modal data including 1D amino-acid sequences and (sequence-predicted) 2D residue-pair contact maps.
We empirically evaluate the embeddings of the two single modalities in their accuracy and generalizability of CPAC prediction.
arXiv Detail & Related papers (2020-11-14T04:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.