Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis
- URL: http://arxiv.org/abs/2503.14356v1
- Date: Tue, 18 Mar 2025 15:40:18 GMT
- Title: Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis
- Authors: Alexander Partin, Priyanka Vasanthakumari, Oleksandr Narykov, Andreas Wilke, Natasha Koussa, Sara E. Jones, Yitan Zhu, Jamie C. Overbeek, Rajeev Jain, Gayara Demini Fernando, Cesar Sanchez-Villalobos, Cristina Garcia-Cardona, Jamaludin Mohd-Yusof, Nicholas Chia, Justin M. Wozniak, Souparno Ghosh, Ranadip Pal, Thomas S. Brettin, M. Ryan Weil, Rick L. Stevens,
- Abstract summary: We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models.<n>We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results)<n>Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
- Score: 36.689210473887904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning (DL) and machine learning (ML) models have shown promise in drug response prediction (DRP), yet their ability to generalize across datasets remains an open question, raising concerns about their real-world applicability. Due to the lack of standardized benchmarking approaches, model evaluations and comparisons often rely on inconsistent datasets and evaluation criteria, making it difficult to assess true predictive capabilities. In this work, we introduce a benchmarking framework for evaluating cross-dataset prediction generalization in DRP models. Our framework incorporates five publicly available drug screening datasets, six standardized DRP models, and a scalable workflow for systematic evaluation. To assess model generalization, we introduce a set of evaluation metrics that quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results), enabling a more comprehensive assessment of model transferability. Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments. While several models demonstrate relatively strong cross-dataset generalization, no single model consistently outperforms across all datasets. Furthermore, we identify CTRPv2 as the most effective source dataset for training, yielding higher generalization scores across target datasets. By sharing this standardized evaluation framework with the community, our study aims to establish a rigorous foundation for model comparison, and accelerate the development of robust DRP models for real-world applications.
Related papers
- On Large-scale Evaluation of Embedding Models for Knowledge Graph Completion [1.2703808802607108]
Knowledge graph embedding (KGE) models are extensively studied for knowledge graph completion, yet their evaluation remains constrained by unrealistic benchmarks.
Standard evaluation metrics rely on the closed-world assumption, which penalizes models for correctly predicting missing triples.
This paper conducts a comprehensive evaluation of four representative KGE models on large-scale datasets FB-CVT-REV and FB+CVT-REV.
arXiv Detail & Related papers (2025-04-11T20:49:02Z) - Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.
They generate only a limited range of perturbations for a single Information Extraction (IE) task.
Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.
We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z) - Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions.
This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context.
We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement.
To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z) - Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.<n>We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z) - On Evaluation of Vision Datasets and Models using Human Competency Frameworks [20.802372291783488]
Item Response Theory (IRT) is a framework that infers interpretable latent parameters for an ensemble of models and each dataset item.
We assess model calibration, select informative data subsets, and demonstrate the usefulness of its latent parameters for analyzing and comparing models and datasets in computer vision.
arXiv Detail & Related papers (2024-09-06T06:20:11Z) - Ranking and Combining Latent Structured Predictive Scores without Labeled Data [2.5064967708371553]
This paper introduces a novel structured unsupervised ensemble learning model (SUEL)
It exploits the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights.
The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery.
arXiv Detail & Related papers (2024-08-14T20:14:42Z) - A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation [17.351089059392674]
We propose a framework for model evaluation that includes stratification, sampling, and estimation components.
We show that stratification via k-means clustering based on accurate predictions of model performance yields efficient estimators.
We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates.
arXiv Detail & Related papers (2024-06-11T14:49:04Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - The Validity of Evaluation Results: Assessing Concurrence Across
Compositionality Benchmarks [27.83907050770602]
We examine the performance of six modeling approaches across 4 datasets, split according to 8 compositional splitting strategies.
Our results demonstrate that much work remains to be done when it comes to assessing whether popular evaluation datasets measure what they intend to measure.
arXiv Detail & Related papers (2023-10-26T16:11:04Z) - On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z) - Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.
Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z) - Think Twice: Measuring the Efficiency of Eliminating Prediction
Shortcuts of Question Answering Models [3.9052860539161918]
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature.
We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA)
We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
arXiv Detail & Related papers (2023-05-11T14:35:00Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.