PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for
Efficient and Generalizable Compound-Protein Interaction Prediction
- URL: http://arxiv.org/abs/2402.08198v1
- Date: Tue, 13 Feb 2024 03:51:10 GMT
- Title: PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for
Efficient and Generalizable Compound-Protein Interaction Prediction
- Authors: Lirong Wu, Yufei Huang, Cheng Tan, Zhangyang Gao, Bozhen Hu, Haitao
Lin, Zicheng Liu, Stan Z. Li
- Abstract summary: Compound-Protein Interaction prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery.
Existing deep learning-based methods utilize only the single modality of protein sequences or structures.
We propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction.
- Score: 63.50967073653953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compound-Protein Interaction (CPI) prediction aims to predict the pattern and
strength of compound-protein interactions for rational drug discovery. Existing
deep learning-based methods utilize only the single modality of protein
sequences or structures and lack the co-modeling of the joint distribution of
the two modalities, which may lead to significant performance drops in complex
real-world scenarios due to various factors, e.g., modality missing and domain
shifting. More importantly, these methods only model protein sequences and
structures at a single fixed scale, neglecting more fine-grained multi-scale
information, such as those embedded in key protein fragments. In this paper, we
propose a novel multi-scale Protein Sequence-structure Contrasting framework
for CPI prediction (PSC-CPI), which captures the dependencies between protein
sequences and structures through both intra-modality and cross-modality
contrasting. We further apply length-variable protein augmentation to allow
contrasting to be performed at different scales, from the amino acid level to
the sequence level. Finally, in order to more fairly evaluate the model
generalizability, we split the test data into four settings based on whether
compounds and proteins have been observed during the training stage. Extensive
experiments have shown that PSC-CPI generalizes well in all four settings,
particularly in the more challenging ``Unseen-Both" setting, where neither
compounds nor proteins have been observed during training. Furthermore, even
when encountering a situation of modality missing, i.e., inference with only
single-modality protein data, PSC-CPI still exhibits comparable or even better
performance than previous approaches.
Related papers
- MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction [65.33218256339151]
Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome.
Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs.
We introduce the MeToken model, which tokenizes the micro-environment of each acid, integrating both sequence and structural information into unified discrete tokens.
arXiv Detail & Related papers (2024-11-04T07:14:28Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction [23.1499716310298]
We build the largest protein-RNA binding affinity dataset PRA310 for performance evaluation.
We provide extensive analyses and verify that CoPRA can (1) accurately predict the protein-RNA binding affinity; (2) understand the binding affinity change caused by mutations; and (3) benefit from scaling data and model size.
arXiv Detail & Related papers (2024-08-21T09:48:22Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Protein Sequence and Structure Co-Design with Equivariant Translation [19.816174223173494]
Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models.
We propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state.
Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features.
All protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process.
arXiv Detail & Related papers (2022-10-17T06:00:12Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - Cross-Modality Protein Embedding for Compound-Protein Affinity and
Contact Prediction [15.955668586941472]
We consider proteins as multi-modal data including 1D amino-acid sequences and (sequence-predicted) 2D residue-pair contact maps.
We empirically evaluate the embeddings of the two single modalities in their accuracy and generalizability of CPAC prediction.
arXiv Detail & Related papers (2020-11-14T04:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.