DecoyDB: A Dataset for Graph Contrastive Learning in Protein-Ligand Binding Affinity Prediction
- URL: http://arxiv.org/abs/2507.06366v1
- Date: Tue, 08 Jul 2025 20:02:53 GMT
- Title: DecoyDB: A Dataset for Graph Contrastive Learning in Protein-Ligand Binding Affinity Prediction
- Authors: Yupu Zhang, Zelin Xu, Tingsong Xiao, Gustavo Seabra, Yanjun Li, Chenglong Li, Zhe Jiang,
- Abstract summary: Predicting the binding affinity of protein-ligand complexes plays a vital role in drug discovery.<n>The widely used PDBbind dataset has fewer than 20K labeled complexes.<n>We propose DecoyDB, a large-scale, structure-aware dataset for self-supervised graph contrastive learning.
- Score: 10.248499818896693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting the binding affinity of protein-ligand complexes plays a vital role in drug discovery. Unfortunately, progress has been hindered by the lack of large-scale and high-quality binding affinity labels. The widely used PDBbind dataset has fewer than 20K labeled complexes. Self-supervised learning, especially graph contrastive learning (GCL), provides a unique opportunity to break the barrier by pre-training graph neural network models based on vast unlabeled complexes and fine-tuning the models on much fewer labeled complexes. However, the problem faces unique challenges, including a lack of a comprehensive unlabeled dataset with well-defined positive/negative complex pairs and the need to design GCL algorithms that incorporate the unique characteristics of such data. To fill the gap, we propose DecoyDB, a large-scale, structure-aware dataset specifically designed for self-supervised GCL on protein-ligand complexes. DecoyDB consists of high-resolution ground truth complexes (less than 2.5 Angstrom) and diverse decoy structures with computationally generated binding poses that range from realistic to suboptimal (negative pairs). Each decoy is annotated with a Root Mean Squared Deviation (RMSD) from the native pose. We further design a customized GCL framework to pre-train graph neural networks based on DecoyDB and fine-tune the models with labels from PDBbind. Extensive experiments confirm that models pre-trained with DecoyDB achieve superior accuracy, label efficiency, and generalizability.
Related papers
- ResCap-DBP: A Lightweight Residual-Capsule Network for Accurate DNA-Binding Protein Prediction Using Global ProteinBERT Embeddings [9.626183317998143]
We propose a novel deep learning framework, ResCap-DBP, that combines a residual learning-based encoder with a one-dimensional Capsule Network.<n>ProteinBERT embeddings substantially outperform other representations on large datasets.<n>Our model consistently outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2025-07-27T21:54:32Z) - ZEUS: Zero-shot Embeddings for Unsupervised Separation of Tabular Data [7.121259735505479]
ZEUS is a self-contained model capable of clustering new datasets without any additional training or fine-tuning.<n>It operates by decomposing complex datasets into meaningful components that can then be clustered effectively.
arXiv Detail & Related papers (2025-05-15T20:52:26Z) - RelGNN: Composite Message Passing for Relational Deep Learning [56.48834369525997]
We introduce RelGNN, a novel GNN framework specifically designed to leverage the unique structural characteristics of the graphs built from relational databases.<n>RelGNN is evaluated on 30 diverse real-world tasks from Relbench (Fey et al., 2024), and achieves state-of-the-art performance on the vast majority tasks, with improvements of up to 25%.
arXiv Detail & Related papers (2025-02-10T18:58:40Z) - Enhancing Missing Data Imputation through Combined Bipartite Graph and Complete Directed Graph [18.06658040186476]
We introduce a novel framework named the Bipartite and Complete Directed Graph Neural Network (BCGNN)
Within BCGNN, observations and features are differentiated as two distinct node types, and the values of observed features are converted into attributed edges linking them.
In parallel, the complete directed graph segment adeptly outlines and communicates the complex interdependencies among features.
arXiv Detail & Related papers (2024-11-07T17:48:37Z) - CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph [66.11279161533619]
CBGBench is a benchmark for structure-based drug design (SBDD)
By categorizing existing methods based on their attributes, CBGBench implements various cutting-edge methods.
We have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks.
arXiv Detail & Related papers (2024-06-16T08:20:24Z) - Graph-level Protein Representation Learning by Structure Knowledge
Refinement [50.775264276189695]
This paper focuses on learning representation on the whole graph level in an unsupervised manner.
We propose a novel framework called Structure Knowledge Refinement (SKR) which uses data structure to determine the probability of whether a pair is positive or negative.
arXiv Detail & Related papers (2024-01-05T09:05:33Z) - End-to-End Supervised Multilabel Contrastive Learning [38.26579519598804]
Multilabel representation learning is recognized as a challenging problem that can be associated with either label dependencies between object categories or data-related issues.
Recent advances address these challenges from model- and data-centric viewpoints.
We propose a new end-to-end training framework -- dubbed KMCL -- to address the shortcomings of both model- and data-centric designs.
arXiv Detail & Related papers (2023-07-08T12:46:57Z) - HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for
Highly Accurate Protein-Ligand Binding Affinity Prediction [0.0]
We present a novel deep learning architecture consisting of a 3-dimensional convolutional neural network and two graph convolutional networks.
HAC-Net obtains state-of-the-art results on the PDBbind v.2016 core set.
We envision that this model can be extended to a broad range of supervised learning problems related to structure-based biomolecular property prediction.
arXiv Detail & Related papers (2022-12-23T16:14:53Z) - Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV)
NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones.
We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z) - Generalization of Neural Combinatorial Solvers Through the Lens of
Adversarial Robustness [68.97830259849086]
Most datasets only capture a simpler subproblem and likely suffer from spurious features.
We study adversarial robustness - a local generalization property - to reveal hard, model-specific instances and spurious features.
Unlike in other applications, where perturbation models are designed around subjective notions of imperceptibility, our perturbation models are efficient and sound.
Surprisingly, with such perturbations, a sufficiently expressive neural solver does not suffer from the limitations of the accuracy-robustness trade-off common in supervised learning.
arXiv Detail & Related papers (2021-10-21T07:28:11Z) - Structure-aware Interactive Graph Neural Networks for the Prediction of
Protein-Ligand Binding Affinity [52.67037774136973]
Drug discovery often relies on the successful prediction of protein-ligand binding affinity.
Recent advances have shown great promise in applying graph neural networks (GNNs) for better affinity prediction by learning the representations of protein-ligand complexes.
We propose a structure-aware interactive graph neural network (SIGN) which consists of two components: polar-inspired graph attention layers (PGAL) and pairwise interactive pooling (PiPool)
arXiv Detail & Related papers (2021-07-21T03:34:09Z) - Learning complex dependency structure of gene regulatory networks from
high dimensional micro-array data with Gaussian Bayesian networks [0.0]
Gene expression datasets consist of thousand of genes with relatively small samplesizes.
Glasso algorithm has been proposed to deal with high dimensional micro-array datasets forcing sparsity.
modifications of the default Glasso algorithm are developed to overcome the problem of complex interaction structure.
arXiv Detail & Related papers (2021-06-28T15:04:35Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.