The curious case of the test set AUROC
- URL: http://arxiv.org/abs/2312.16188v1
- Date: Tue, 19 Dec 2023 17:40:58 GMT
- Title: The curious case of the test set AUROC
- Authors: Michael Roberts, Alon Hazan, S\"oren Dittmer, James H.F. Rudd, and
Carola-Bibiane Sch\"onlieb
- Abstract summary: We argue that considering scores derived from the test ROC curve alone gives only a narrow insight into how a model performs and its ability to generalise.
- Score: 0.5242869847419834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Whilst the size and complexity of ML models have rapidly and significantly
increased over the past decade, the methods for assessing their performance
have not kept pace. In particular, among the many potential performance
metrics, the ML community stubbornly continues to use (a) the area under the
receiver operating characteristic curve (AUROC) for a validation and test
cohort (distinct from training data) or (b) the sensitivity and specificity for
the test data at an optimal threshold determined from the validation ROC.
However, we argue that considering scores derived from the test ROC curve alone
gives only a narrow insight into how a model performs and its ability to
generalise.
Related papers
- Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - Multi-stream deep learning framework to predict mild cognitive impairment with Rey Complex Figure Test [10.324611550865926]
We developed a multi-stream deep learning framework that integrates two distinct processing streams.
The proposed multi-stream model demonstrated superior performance over baseline models in external validation.
Our model has practical implications for clinical settings, where it could serve as a cost-effective tool for early screening.
arXiv Detail & Related papers (2024-09-04T17:08:04Z) - Federated Nonparametric Hypothesis Testing with Differential Privacy Constraints: Optimal Rates and Adaptive Tests [5.3595271893779906]
Federated learning has attracted significant recent attention due to its applicability across a wide range of settings where data is collected and analyzed across disparate locations.
We study federated nonparametric goodness-of-fit testing in the white-noise-with-drift model under distributed differential privacy (DP) constraints.
arXiv Detail & Related papers (2024-06-10T19:25:19Z) - Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study [61.64685376882383]
Counterfactual learning to rank (CLTR) has attracted extensive attention in the IR community for its ability to leverage massive logged user interaction data to train ranking models.
This paper investigates the robustness of existing CLTR models in complex and diverse situations.
We find that the DLA models and IPS-DCM show better robustness under various simulation settings than IPS-PBM and PRS with offline propensity estimation.
arXiv Detail & Related papers (2024-04-04T10:54:38Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Low-Cost High-Power Membership Inference Attacks [15.240271537329534]
Membership inference attacks aim to detect if a particular data point was used in training a model.
We design a novel statistical test to perform robust membership inference attacks with low computational overhead.
RMIA lays the groundwork for practical yet accurate data privacy risk assessment in machine learning.
arXiv Detail & Related papers (2023-12-06T03:18:49Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16 [0.29998889086656577]
We show that relatively minor modifications on a benchmark dataset cause significantly more impact on model performance than the specific ML technique considered.
We also show that the measured model performance is uncertain, as a result of labelling inaccuracies.
arXiv Detail & Related papers (2023-05-31T12:03:12Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - A Targeted Accuracy Diagnostic for Variational Approximations [8.969208467611896]
Variational Inference (VI) is an attractive alternative to Markov Chain Monte Carlo (MCMC)
Existing methods characterize the quality of the whole variational distribution.
We propose the TArgeted Diagnostic for Distribution Approximation Accuracy (TADDAA)
arXiv Detail & Related papers (2023-02-24T02:50:18Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.