Scaffold-Aware Generative Augmentation and Reranking for Enhanced Virtual Screening
- URL: http://arxiv.org/abs/2510.16306v1
- Date: Sat, 18 Oct 2025 02:26:08 GMT
- Title: Scaffold-Aware Generative Augmentation and Reranking for Enhanced Virtual Screening
- Authors: Xin Wang, Yu Wang, Yunchao Liu, Jens Meiler, Tyler Derr,
- Abstract summary: Ligand-based virtual screening (VS) is an essential step in drug discovery.<n>VS faces three major challenges: class imbalance due to the low active rate, structural imbalance among active molecules where certain scaffolds dominate, and the need to identify structurally diverse active compounds.<n>SaffAug is a scaffold-aware VS framework that addresses these challenges through three modules.
- Score: 16.751838234028515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ligand-based virtual screening (VS) is an essential step in drug discovery that evaluates large chemical libraries to identify compounds that potentially bind to a therapeutic target. However, VS faces three major challenges: class imbalance due to the low active rate, structural imbalance among active molecules where certain scaffolds dominate, and the need to identify structurally diverse active compounds for novel drug development. We introduce ScaffAug, a scaffold-aware VS framework that addresses these challenges through three modules. The augmentation module first generates synthetic data conditioned on scaffolds of actual hits using generative AI, specifically a graph diffusion model. This helps mitigate the class imbalance and furthermore the structural imbalance, due to our proposed scaffold-aware sampling algorithm, designed to produce more samples for active molecules with underrepresented scaffolds. A model-agnostic self-training module is then used to safely integrate the generated synthetic data from our augmentation module with the original labeled data. Lastly, we introduce a reranking module that improves VS by enhancing scaffold diversity in the top recommended set of molecules, while still maintaining and even enhancing the overall general performance of identifying novel, active compounds. We conduct comprehensive computational experiments across five target classes, comparing ScaffAug against existing baseline methods by reporting the performance of multiple evaluation metrics and performing ablation studies on ScaffAug. Overall, this work introduces novel perspectives on effectively enhancing VS by leveraging generative augmentations, reranking, and general scaffold-awareness.
Related papers
- MSCoD: An Enhanced Bayesian Updating Framework with Multi-Scale Information Bottleneck and Cooperative Attention for Structure-Based Drug Design [2.4060611914774763]
We propose MSCoD, a novel Bayesian updating-based generative framework for structure-based drug design.<n>In our MSCoD, Multi-Scale Information Bottleneck (MSIB) was developed, which enables semantic compression at multiple abstraction levels.<n>MSCoD outperforms state-of-the-art methods on the benchmark dataset.
arXiv Detail & Related papers (2025-09-24T10:51:07Z) - CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction [9.316793780511917]
bfCAME-AB is a novel Cross-modality Attention framework for antibody binding site prediction.<n>It integrates raw acid encodings, BLOSUM substitution profiles, pretrained language model embeddings, structure-aware features, and biochemical graphs.<n>It consistently outperforms strong baselines on multiple metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC.
arXiv Detail & Related papers (2025-09-08T09:24:09Z) - Active Learning on Synthons for Molecular Design [0.0]
We introduce Scalable Active Learning via Synthon Acquisition (SALSA), a simple algorithm applicable to multi-vector expansion.<n>SALSA extends pool-based active learning to non-enumerable spaces by factoring modeling and acquisition over synthon or fragment choices.<n>We show that SALSA-generated molecules have comparable chemical property profiles to known bioactives, and exhibit greater diversity and higher scores over an industry-leading generative approach.
arXiv Detail & Related papers (2025-05-19T09:48:02Z) - Optimizing OOD Detection in Molecular Graphs: A Novel Approach with Diffusion Models [71.39421638547164]
We propose to detect OOD molecules by adopting an auxiliary diffusion model-based framework, which compares similarities between input molecules and reconstructed graphs.
Due to the generative bias towards reconstructing ID training samples, the similarity scores of OOD molecules will be much lower to facilitate detection.
Our research pioneers an approach of Prototypical Graph Reconstruction for Molecular OOD Detection, dubbed as PGR-MOOD and hinges on three innovations.
arXiv Detail & Related papers (2024-04-24T03:25:53Z) - DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design [62.68420322996345]
Existing structured-based drug design methods treat all ligand atoms equally.
We propose a new diffusion model, DecompDiff, with decomposed priors over arms and scaffold.
Our approach achieves state-of-the-art performance in generating high-affinity molecules.
arXiv Detail & Related papers (2024-02-26T05:21:21Z) - Objective-Agnostic Enhancement of Molecule Properties via Multi-Stage
VAE [1.3597551064547502]
Variational autoencoder (VAE) is a popular method for drug discovery and various architectures and pipelines have been proposed to improve its performance.
VAE approaches are known to suffer from poor manifold recovery when the data lie on a low-dimensional manifold embedded in a higher dimensional ambient space.
In this paper, we explore applying a multi-stage VAE approach, that can improve manifold recovery on a synthetic dataset, to the field of drug discovery.
arXiv Detail & Related papers (2023-08-24T20:22:22Z) - Augment and Criticize: Exploring Informative Samples for Semi-Supervised
Monocular 3D Object Detection [64.65563422852568]
We improve the challenging monocular 3D object detection problem with a general semi-supervised framework.
We introduce a novel, simple, yet effective Augment and Criticize' framework that explores abundant informative samples from unlabeled data.
The two new detectors, dubbed 3DSeMo_DLE and 3DSeMo_FLEX, achieve state-of-the-art results with remarkable improvements for over 3.5% AP_3D/BEV (Easy) on KITTI.
arXiv Detail & Related papers (2023-03-20T16:28:15Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - Structure-based Drug Design with Equivariant Diffusion Models [40.73626627266543]
We present DiffSBDD, an SE(3)-equivariant diffusion model that generates novel conditioned on protein pockets.
Our in silico experiments demonstrate that DiffSBDD captures the statistics of the ground truth data effectively.
These results support the assumption that diffusion models represent the complex distribution of structural data more accurately than previous methods.
arXiv Detail & Related papers (2022-10-24T15:51:21Z) - Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization.
We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise.
We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z) - Adversarial Feature Hallucination Networks for Few-Shot Learning [84.31660118264514]
Adversarial Feature Hallucination Networks (AFHN) is based on conditional Wasserstein Generative Adversarial networks (cWGAN)
Two novel regularizers are incorporated into AFHN to encourage discriminability and diversity of the synthesized features.
arXiv Detail & Related papers (2020-03-30T02:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.