Constructing Confidence Intervals for 'the' Generalization Error -- a Comprehensive Benchmark Study
- URL: http://arxiv.org/abs/2409.18836v2
- Date: Wed, 15 Jan 2025 10:02:02 GMT
- Title: Constructing Confidence Intervals for 'the' Generalization Error -- a Comprehensive Benchmark Study
- Authors: Hannah Schulz-Kümpel, Sebastian Fischer, Roman Hornung, Anne-Laure Boulesteix, Thomas Nagler, Bernd Bischl,
- Abstract summary: In machine learning, confidence intervals (CIs) for the generalization error are a crucial tool.<n>We evaluate 13 different CI methods on a total of 19 regression and classification problems, using seven different inducers and a total of eight loss functions.<n>We give an overview of the methodological foundations and inherent challenges of constructing CIs for the generalization error and provide a concise review of all 13 methods in a unified framework.
- Score: 7.094603504956301
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When assessing the quality of prediction models in machine learning, confidence intervals (CIs) for the generalization error, which measures predictive performance, are a crucial tool. Luckily, there exist many methods for computing such CIs and new promising approaches are continuously being proposed. Typically, these methods combine various resampling procedures, most popular among them cross-validation and bootstrapping, with different variance estimation techniques. Unfortunately, however, there is currently no consensus on when any of these combinations may be most reliably employed and how they generally compare. In this work, we conduct a large-scale study comparing CIs for the generalization error, the first one of such size, where we empirically evaluate 13 different CI methods on a total of 19 tabular regression and classification problems, using seven different inducers and a total of eight loss functions. We give an overview of the methodological foundations and inherent challenges of constructing CIs for the generalization error and provide a concise review of all 13 methods in a unified framework. Finally, the CI methods are evaluated in terms of their relative coverage frequency, width, and runtime. Based on these findings, we can identify a subset of methods that we would recommend. We also publish the datasets as a benchmarking suite on OpenML and our code on GitHub to serve as a basis for further studies.
Related papers
- POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation [76.67608003501479]
We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators.
The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.
arXiv Detail & Related papers (2024-07-20T16:37:21Z) - Robust CATE Estimation Using Novel Ensemble Methods [0.8246494848934447]
estimation of Conditional Average Treatment Effects (CATE) is crucial for understanding the heterogeneity of treatment effects in clinical trials.
We evaluate the performance of common methods, including causal forests and various meta-learners, across a diverse set of scenarios.
We propose two new ensemble methods that integrate multiple estimators to enhance prediction stability and performance.
arXiv Detail & Related papers (2024-07-04T07:23:02Z) - Combine and Conquer: A Meta-Analysis on Data Shift and Out-of-Distribution Detection [30.377446496559635]
This paper introduces a universal approach to seamlessly combine out-of-distribution (OOD) detection scores.
Our framework is easily for future developments in detection scores and stands as the first to combine decision boundaries in this context.
arXiv Detail & Related papers (2024-06-23T08:16:44Z) - Evaluating machine learning models in non-standard settings: An overview
and new findings [7.834267158484847]
Estimating the generalization error (GE) of machine learning models is fundamental.
In non-standard settings, particularly those where observations are not independently and identically distributed, resampling may lead to biased GE estimates.
This paper presents well-grounded guidelines for GE estimation in various such non-standard settings.
arXiv Detail & Related papers (2023-10-23T17:15:11Z) - Exploiting Structure for Optimal Multi-Agent Bayesian Decentralized
Estimation [4.320393382724066]
Key challenge in Bayesian decentralized data fusion is the rumor propagation' or double counting' phenomenon.
We show that by exploiting the probabilistic independence structure in multi-agent decentralized fusion problems a tighter bound can be found.
We then test our new non-monolithic CI algorithm on a large-scale target tracking simulation and show that it achieves a tighter bound and a more accurate estimate.
arXiv Detail & Related papers (2023-07-20T05:16:33Z) - Rethinking Clustering-Based Pseudo-Labeling for Unsupervised
Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling.
This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data.
We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z) - Assaying Out-Of-Distribution Generalization in Transfer Learning [103.57862972967273]
We take a unified view of previous work, highlighting message discrepancies that we address empirically.
We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting.
arXiv Detail & Related papers (2022-07-19T12:52:33Z) - Posterior and Computational Uncertainty in Gaussian Processes [52.26904059556759]
Gaussian processes scale prohibitively with the size of the dataset.
Many approximation methods have been developed, which inevitably introduce approximation error.
This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior.
We develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended.
arXiv Detail & Related papers (2022-05-30T22:16:25Z) - TsmoBN: Interventional Generalization for Unseen Clients in Federated
Learning [23.519212374186232]
We form a training structural causal model (SCM) to explain the challenges of model generalization in a distributed learning paradigm.
We present a simple yet effective method using test-specific and momentum tracked batch normalization (TsmoBN) to generalize FL models to testing clients.
arXiv Detail & Related papers (2021-10-19T13:46:37Z) - A generalized framework for active learning reliability: survey and
benchmark [0.0]
We propose a modular framework to build on-the-fly efficient active learning strategies.
We devise 39 strategies for the solution of 20 reliability benchmark problems.
arXiv Detail & Related papers (2021-06-03T09:33:59Z) - Re-Assessing the "Classify and Count" Quantification Method [88.60021378715636]
"Classify and Count" (CC) is often a biased estimator.
Previous works have failed to use properly optimised versions of CC.
We argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy.
arXiv Detail & Related papers (2020-11-04T21:47:39Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - An Empirical Evaluation on Robustness and Uncertainty of Regularization
Methods [43.25086015530892]
Deep neural networks (DNNs) behave fundamentally differently from humans.
They can easily change predictions when small corruptions such as blur are applied on the input.
They produce confident predictions on out-of-distribution samples (improper uncertainty measure)
arXiv Detail & Related papers (2020-03-09T01:15:22Z) - A General Method for Robust Learning from Batches [56.59844655107251]
We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains.
We derive the first robust computationally-efficient learning algorithms for piecewise-interval classification, and for piecewise-polynomial, monotone, log-concave, and gaussian-mixture distribution estimation.
arXiv Detail & Related papers (2020-02-25T18:53:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.