WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking
- URL: http://arxiv.org/abs/2411.09820v1
- Date: Thu, 14 Nov 2024 21:49:41 GMT
- Title: WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking
- Authors: Yunchao, Liu, Ha Dong, Xin Wang, Rocco Moretti, Yu Wang, Zhaoqian Su, Jiawei Gu, Bobby Bodenheimer, Charles David Weaver, Jens Meiler, Tyler Derr,
- Abstract summary: deep learning has revolutionized computer-aided drug discovery.
While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation.
We seek to establish a new gold standard for small molecule drug discovery benchmarking, WelQrate.
- Score: 13.880278087741482
- License:
- Abstract: While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community's efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery. Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, WelQrate. Specifically, our contributions are threefold: WelQrate Dataset Collection - we introduce a meticulously curated collection of 9 datasets spanning 5 therapeutic target classes. Our hierarchical curation pipelines, designed by drug discovery experts, go beyond the primary high-throughput screen by leveraging additional confirmatory and counter screens along with rigorous domain-driven preprocessing, such as Pan-Assay Interference Compounds (PAINS) filtering, to ensure the high-quality data in the datasets; WelQrate Evaluation Framework - we propose a standardized model evaluation framework considering high-quality datasets, featurization, 3D conformation generation, evaluation metrics, and data splits, which provides a reliable benchmarking for drug discovery experts conducting real-world virtual screening; Benchmarking - we evaluate model performance through various research questions using the WelQrate dataset collection, exploring the effects of different models, dataset quality, featurization methods, and data splitting strategies on the results. In summary, we recommend adopting our proposed WelQrate as the gold standard in small molecule drug discovery benchmarking. The WelQrate dataset collection, along with the curation codes, and experimental scripts are all publicly available at WelQrate.org.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - EBES: Easy Benchmarking for Event Sequences [17.277513178760348]
Event sequences are common data structures in various real-world domains such as healthcare, finance, and user interaction logs.
Despite advances in temporal data modeling techniques, there is no standardized benchmarks for evaluating their performance on event sequences.
We introduce EBES, a comprehensive benchmarking tool with standardized evaluation scenarios and protocols.
arXiv Detail & Related papers (2024-10-04T13:03:43Z) - Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Rethinking Model Prototyping through the MedMNIST+ Dataset Collection [0.11999555634662634]
This work presents a benchmark for the MedMNIST+ database to diversify the evaluation landscape.
We conduct a thorough analysis of common convolutional neural networks (CNNs) and Transformer-based architectures, for medical image classification.
Our findings suggest that computationally efficient training schemes and modern foundation models hold promise in bridging the gap between expensive end-to-end training and more resource-refined approaches.
arXiv Detail & Related papers (2024-04-24T10:19:25Z) - Towards a more inductive world for drug repurposing approaches [0.545520830707066]
Drug-target interaction (DTI) prediction is a challenging, albeit essential task in drug repurposing.
We show that DTI prediction methods based on transductive models lack generalization and lead to inflated performance.
We propose a novel biologically-driven strategy for negative edge subsampling and show through in vitro validation that newly discovered interactions are indeed true.
arXiv Detail & Related papers (2023-11-21T15:28:44Z) - On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z) - Drug Discovery under Covariate Shift with Domain-Informed Prior
Distributions over Functions [30.305418761024143]
Real-world drug discovery tasks are often characterized by a scarcity of labeled data and a significant range of data.
We present a principled way to encode explicit prior knowledge of the data-generating process into a prior distribution.
We demonstrate that using integrate Q-SAVI to contextualize prior knowledgelike chemical space into the modeling process affords substantial accuracy and calibration.
arXiv Detail & Related papers (2023-07-14T05:01:10Z) - DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for
AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise
Annotations [90.27736364704108]
We present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery.
DrugOOD comes with an open-source Python package that fully automates benchmarking processes.
We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction.
arXiv Detail & Related papers (2022-01-24T12:32:48Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.