Related papers: Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost Functions

Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost Functions

URL: http://arxiv.org/abs/2007.07029v1
Date: Thu, 25 Jun 2020 08:46:37 GMT
Title: Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost Functions
Authors: Vladimir Golkov, Alexander Becker, Daniel T. Plop, Daniel \v{C}uturilo, Neda Davoudi, Jeffrey Mendenhall, Rocco Moretti, Jens Meiler, Daniel Cremers
Abstract summary: deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features. Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets. We argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance.
Score: 80.12620331438052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computer-aided drug discovery is an essential component of modern drug development. Therein, deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features. Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets. In this work we argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance, its ability to compromise over different decision thresholds, certain freedom to influence the relative weights in this compromise, fidelity to typical benchmarking measures, and equivalence to positive/unlabeled learning. We also propose new training schemes (coherent mini-batch arrangement, and usage of out-of-batch samples) for cost functions based on the ROC, as well as a cost function based on the logAUC metric that facilitates early enrichment (i.e. improves performance at high decision thresholds, as often desired when synthesizing predicted hit compounds). We demonstrate that these approaches outperform standard deep learning approaches on a series of PubChem high-throughput screening datasets that represent realistic and diverse drug discovery campaigns on major drug target families.

Related papers

A Semi-supervised Molecular Learning Framework for Activity Cliff Estimation [10.640733919289643]
We propose a novel semi-supervised learning (SSL) method dubbed SemiMol.<n>SemiMol employs predictions on numerous unannotated data as pseudo-signals for subsequent training.<n>We show that SemiMol significantly enhances graph-based ML architectures and outpasses state-of-the-art pretraining and SSL baselines.
arXiv Detail & Related papers (2026-01-08T02:20:25Z)
A Hybrid Computational Intelligence Framework with Metaheuristic Optimization for Drug-Drug Interaction Prediction [0.8602553195689512]
Drug-drug interactions (DDIs) are a leading cause of preventable adverse events, often complicating treatment and increasing healthcare costs.<n>We propose an interpretable and efficient framework that blends modern machine learning with domain knowledge to improve DDI prediction.<n>Our approach combines two complementary embeddings - Mol2Vec, which captures fragment-level structural patterns, and SMILES-BERT, which learns contextual chemical features.
arXiv Detail & Related papers (2025-10-08T09:55:18Z)
Robust Molecular Property Prediction via Densifying Scarce Labeled Data [51.55434084913129]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel meta-learning-based approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.<n>We demonstrate significant performance gains on challenging real-world datasets.
arXiv Detail & Related papers (2025-06-13T15:27:40Z)
Efficient Biological Data Acquisition through Inference Set Design [3.9633147697178996]
In this work, we aim to select the smallest set of candidates in order to achieve some desired level of accuracy for the system as a whole. We call this mechanism inference set design, and propose the use of a confidence-based active learning solution to prune out challenging examples.
arXiv Detail & Related papers (2024-10-25T15:34:03Z)
Electroencephalogram Emotion Recognition via AUC Maximization [0.0]
Imbalanced datasets pose significant challenges in areas including neuroscience, cognitive science, and medical diagnostics. This study addresses the issue class imbalance, using the Liking' label in the DEAP dataset as an example.
arXiv Detail & Related papers (2024-08-16T19:08:27Z)
SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction [16.189335444981353]
Predicting the absorption, distribution, metabolism, excretion, and toxicity of small-molecule drugs is critical for ensuring safety and efficacy. We propose a two-stage model that leverages both unlabeled and labeled data through a combination of self-supervised pretraining and fine-tuning strategies. Our results demonstrate that SMILES-Mamba exhibits competitive performance across 22 ADMET datasets, achieving the highest score in 14 tasks.
arXiv Detail & Related papers (2024-08-11T04:53:12Z)
YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention [9.018408514318631]
Traditional methods often miss complex molecular structures, leading to inaccuracies. We introduce the YZS-Model, a deep learning framework integrating Graph Convolutional Networks (GCN), Transformer architectures, and Long Short-Term Memory (LSTM) networks. YZS-Model achieved an $R2$ of 0.59 and an RMSE of 0.57, outperforming benchmark models.
arXiv Detail & Related papers (2024-06-27T12:40:29Z)
Physical formula enhanced multi-task learning for pharmacokinetics prediction [54.13787789006417]
A major challenge for AI-driven drug discovery is the scarcity of high-quality data. We develop a formula enhanced mul-ti-task learning (PEMAL) method that predicts four key parameters of pharmacokinetics simultaneously. Our experiments reveal that PEMAL significantly lowers the data demand, compared to typical Graph Neural Networks.
arXiv Detail & Related papers (2024-04-16T07:42:55Z)
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
Unpaired Deep Learning for Pharmacokinetic Parameter Estimation from Dynamic Contrast-Enhanced MRI [37.358265461543716]
We present a novel unpaired deep learning method for estimating both pharmacokinetic parameters and the AIF. Our proposed CycleGAN framework is designed based on the underlying physics model, resulting in a simpler architecture with a single generator and discriminator pair. Our experimental results indicate that our method, which does not necessitate separate AIF measurements, produces more reliable pharmacokinetic parameters than other techniques.
arXiv Detail & Related papers (2023-06-07T11:10:10Z)
Accurate, reliable and interpretable solubility prediction of druglike molecules with attention pooling and Bayesian learning [1.8275108630751844]
In silico prediction of solubility has been studied for its utility in virtual screening and lead optimization. Recently, machine learning (ML) methods using experimental data has been popular because physics-based methods are not suitable for high- throughput tasks. In this paper, we develop graph neural networks (GNNs) with the self-attention readout layer to improve prediction performance.
arXiv Detail & Related papers (2022-09-29T07:48:10Z)
MetaRF: Differentiable Random Forest for Reaction Yield Prediction with a Few Trails [58.47364143304643]
In this paper, we focus on the reaction yield prediction problem. We first put forth MetaRF, an attention-based differentiable random forest model specially designed for the few-shot yield prediction. To improve the few-shot learning performance, we further introduce a dimension-reduction based sampling method.
arXiv Detail & Related papers (2022-08-22T06:40:13Z)
SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity Prediction [127.43571146741984]
Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery. wet experiments remain the most reliable method, but they are time-consuming and resource-intensive. Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue. We present the SSM-DTA framework, which incorporates three simple yet highly effective strategies.
arXiv Detail & Related papers (2022-06-20T14:53:25Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.