Accounting for multiplicity in machine learning benchmark performance
- URL: http://arxiv.org/abs/2303.07272v5
- Date: Wed, 11 Sep 2024 18:38:10 GMT
- Title: Accounting for multiplicity in machine learning benchmark performance
- Authors: Kajsa Møllersen, Einar Holsbø,
- Abstract summary: Using the highest-ranked performance as an estimate for state-of-the-art (SOTA) performance is a biased estimator, giving overly optimistic results.
In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss three real-world examples; Kaggle competitions that demonstrate various aspects.
Related papers
- Quantifying Variance in Evaluation Benchmarks [34.12254884944099]
We measure variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training.
We find that simple changes, such as framing choice tasks as completion tasks, can often reduce variance for smaller scale.
More involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance.
arXiv Detail & Related papers (2024-06-14T17:59:54Z) - Unsupervised Estimation of Ensemble Accuracy [0.0]
We present a method for estimating the joint power of several classifiers.
It differs from existing approaches which focus on "diversity" measures by not relying on labels.
We demonstrate the method on popular large-scale face recognition datasets.
arXiv Detail & Related papers (2023-11-18T02:31:36Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - Realistic Evaluation of Transductive Few-Shot Learning [41.06192162435249]
Transductive inference is widely used in few-shot learning.
We study the effect of arbitrary class distributions within the query sets of few-shot tasks at inference.
We evaluate experimentally state-of-the-art transductive methods over 3 widely used data sets.
arXiv Detail & Related papers (2022-04-24T03:35:06Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - What are the best systems? New perspectives on NLP Benchmarking [10.27421161397197]
We propose a new procedure to rank systems based on their performance across different tasks.
Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task.
We show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure.
arXiv Detail & Related papers (2022-02-08T11:44:20Z) - An Investigation of Replay-based Approaches for Continual Learning [79.0660895390689]
Continual learning (CL) is a major challenge of machine learning (ML) and describes the ability to learn several tasks sequentially without catastrophic forgetting (CF)
Several solution classes have been proposed, of which so-called replay-based approaches seem very promising due to their simplicity and robustness.
We empirically investigate replay-based approaches of continual learning and assess their potential for applications.
arXiv Detail & Related papers (2021-08-15T15:05:02Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Minority Class Oversampling for Tabular Data with Deep Generative Models [4.976007156860967]
We study the ability of deep generative models to provide realistic samples that improve performance on imbalanced classification tasks via oversampling.
Our experiments show that the way the method of sampling does not affect quality, but runtime varies widely.
We also observe that the improvements in terms of performance metric, while shown to be significant, often are minor in absolute terms.
arXiv Detail & Related papers (2020-05-07T21:35:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.