Related papers: Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

URL: http://arxiv.org/abs/2505.20036v1
Date: Mon, 26 May 2025 14:23:08 GMT
Title: Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction
Authors: Hazem Alsamkary, Mohamed Elshaffei, Mohamed Soudy, Sara Ossman, Abdallah Amr, Nehal Adel Abdelsalam, Mohamed Elkerdawy, Ahmed Elnaggar,
Abstract summary: Protein-protein interactions (PPIs) are fundamental to numerous cellular processes.<n>PLMs have demonstrated remarkable success in predicting protein structure and function.<n>Their application to sequence-based PPI binding affinity prediction remains relatively underexplored.
Score: 0.2509487459755192
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

Related papers

PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs [80.08310253195144]
PRING is the first benchmark that evaluates protein-protein interaction prediction from a graph-level perspective.<n> PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions.
arXiv Detail & Related papers (2025-07-07T15:21:05Z)
Hierarchical Multi-Label Contrastive Learning for Protein-Protein Interaction Prediction Across Organisms [2.399426243085768]
We propose HIPPO, a hierarchical contrastive framework for protein-protein interaction prediction.<n>The proposed approach incorporates hierarchical contrastive loss functions that emulate the structured relationship among functional classes of proteins.<n> Experiments on benchmark datasets demonstrate that HIPPO achieves state-of-the-art performance, outperforming existing methods and showing robustness in low-data regimes.
arXiv Detail & Related papers (2025-07-03T15:41:04Z)
KEPLA: A Knowledge-Enhanced Deep Learning Framework for Accurate Protein-Ligand Binding Affinity Prediction [60.23701115249195]
KEPLA is a novel deep learning framework that integrates prior knowledge from Gene Ontology and ligand properties to enhance prediction performance.<n> Experiments on two benchmark datasets demonstrate that KEPLA consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-16T08:02:42Z)
Structure-Aligned Protein Language Model [42.03167740260325]
Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but lack structural knowledge essential for many biological applications.<n>We integrate structural insights from pre-trained protein graph neural networks (pGNNs) into pLMs through a latent-level contrastive learning task.<n>This task aligns residue representations from pLMs with those from pGNNs across multiple proteins, enriching pLMs with inter-protein structural knowledge.
arXiv Detail & Related papers (2025-05-22T16:56:12Z)
Bidirectional Hierarchical Protein Multi-Modal Representation Learning [4.682021474006426]
Protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks.<n> graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein-related prediction tasks.<n>Our framework employs attention and gating mechanisms to enable effective interaction between pLMs-generated sequential representations and GNN-extracted structural features.
arXiv Detail & Related papers (2025-04-07T06:47:49Z)
Joint Masked Reconstruction and Contrastive Learning for Mining Interactions Between Proteins [4.254824555546419]
Protein-protein interaction (PPI) prediction is an instrumental means in elucidating the mechanisms underlying cellular operations.<n>This paper introduces a novel PPI prediction method jointing masked reconstruction and contrastive learning, termed JmcPPI.<n>Extensive experiments conducted on three widely utilized PPI datasets demonstrate that JmcPPI surpasses existing optimal baseline models.
arXiv Detail & Related papers (2025-03-06T17:39:12Z)
MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction [65.33218256339151]
Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome. Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs. We introduce the MeToken model, which tokenizes the micro-environment of each acid, integrating both sequence and structural information into unified discrete tokens.
arXiv Detail & Related papers (2024-11-04T07:14:28Z)
ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions. We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z)
MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding [82.31506767274841]
Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities. MPAE-PPI encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment "vocabulary" MPAE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency.
arXiv Detail & Related papers (2024-02-22T09:04:41Z)
PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction [63.50967073653953]
Compound-Protein Interaction prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery. Existing deep learning-based methods utilize only the single modality of protein sequences or structures. We propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction.
arXiv Detail & Related papers (2024-02-13T03:51:10Z)
State-specific protein-ligand complex structure prediction with a multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures. Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z)
A Supervised Machine Learning Approach for Sequence Based Protein-protein Interaction (PPI) Prediction [4.916874464940376]
Computational protein-protein interaction (PPI) prediction techniques can contribute greatly in reducing time, cost and false-positive interactions. We have described our submitted solution with the results of the SeqPIP competition.
arXiv Detail & Related papers (2022-03-23T18:27:25Z)
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM) Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.