Accounting for multiplicity in machine learning benchmark performance
- URL: http://arxiv.org/abs/2303.07272v6
- Date: Mon, 14 Jul 2025 20:05:42 GMT
- Title: Accounting for multiplicity in machine learning benchmark performance
- Authors: Kajsa Møllersen, Einar Holsbø,
- Abstract summary: State-of-the-art (SOTA) performance refers to the highest performance achieved by some model on a test sample.<n>We argue that SOTA should instead be estimated by the expected performance of the best classifier.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: State-of-the-art (SOTA) performance refers to the highest performance achieved by some model on a test sample, preferably under controlled conditions such as public data (reproducibility) or public challenges (independent sample). Thousands of classifiers are applied, and the highest performance becomes the new reference point for a particular problem. In effect, this set-up is an estimate of the expected best performance among all classifiers applied to a random sample; a sample maximum estimate. In this paper, we argue that SOTA should instead be estimated by the expected performance of the best classifier, which can be done without knowing which classifier it is. Our contribution is the formal distinction between the two, and an investigation into the practical consequences of using the former to estimate the latter. This is done by presenting sample maximum estimator distributions for non-identical and dependent classifiers. We illustrate the impact on real world examples from public challenges.
Related papers
- Quantifying Variance in Evaluation Benchmarks [34.12254884944099]
We measure variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training.
We find that simple changes, such as framing choice tasks as completion tasks, can often reduce variance for smaller scale.
More involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance.
arXiv Detail & Related papers (2024-06-14T17:59:54Z) - Unsupervised Estimation of Ensemble Accuracy [0.0]
We present a method for estimating the joint power of several classifiers.
It differs from existing approaches which focus on "diversity" measures by not relying on labels.
We demonstrate the method on popular large-scale face recognition datasets.
arXiv Detail & Related papers (2023-11-18T02:31:36Z) - Statistical Inference for Fairness Auditing [4.318555434063274]
We frame this task as "fairness auditing," in terms of multiple hypothesis testing.
We show how the bootstrap can be used to simultaneously bound performance disparities over a collection of groups.
Our methods can be used to flag subpopulations affected by model underperformance, and certify subpopulations for which the model performs adequately.
arXiv Detail & Related papers (2023-05-05T17:54:22Z) - Synergies between Disentanglement and Sparsity: Generalization and
Identifiability in Multi-Task Learning [79.83792914684985]
We prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations.
Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem.
arXiv Detail & Related papers (2022-11-26T21:02:09Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - Beyond Adult and COMPAS: Fairness in Multi-Class Prediction [8.405162568925405]
We formulate this problem in terms of "projecting" a pre-trained (and potentially unfair) classifier onto the set of models that satisfy target group-fairness requirements.
We provide a parallelizable iterative algorithm for computing the projected classifier and derive both sample complexity and convergence guarantees.
We also evaluate our method at scale on an open dataset with multiple classes, multiple intersectional protected groups, and over 1M samples.
arXiv Detail & Related papers (2022-06-15T20:29:33Z) - Realistic Evaluation of Transductive Few-Shot Learning [41.06192162435249]
Transductive inference is widely used in few-shot learning.
We study the effect of arbitrary class distributions within the query sets of few-shot tasks at inference.
We evaluate experimentally state-of-the-art transductive methods over 3 widely used data sets.
arXiv Detail & Related papers (2022-04-24T03:35:06Z) - Boost Test-Time Performance with Closed-Loop Inference [85.43516360332646]
We propose to predict hard-classified test samples in a looped manner to boost the model performance.
We first devise a filtering criterion to identify those hard-classified test samples that need additional inference loops.
For each hard sample, we construct an additional auxiliary learning task based on its original top-$K$ predictions to calibrate the model.
arXiv Detail & Related papers (2022-03-21T10:20:21Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - What are the best systems? New perspectives on NLP Benchmarking [10.27421161397197]
We propose a new procedure to rank systems based on their performance across different tasks.
Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task.
We show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure.
arXiv Detail & Related papers (2022-02-08T11:44:20Z) - When in Doubt: Improving Classification Performance with Alternating
Normalization [57.39356691967766]
We introduce Classification with Alternating Normalization (CAN), a non-parametric post-processing step for classification.
CAN improves classification accuracy for challenging examples by re-adjusting their predicted class probability distribution.
We empirically demonstrate its effectiveness across a diverse set of classification tasks.
arXiv Detail & Related papers (2021-09-28T02:55:42Z) - An Investigation of Replay-based Approaches for Continual Learning [79.0660895390689]
Continual learning (CL) is a major challenge of machine learning (ML) and describes the ability to learn several tasks sequentially without catastrophic forgetting (CF)
Several solution classes have been proposed, of which so-called replay-based approaches seem very promising due to their simplicity and robustness.
We empirically investigate replay-based approaches of continual learning and assess their potential for applications.
arXiv Detail & Related papers (2021-08-15T15:05:02Z) - Unsupervised Learning of Debiased Representations with Pseudo-Attributes [85.5691102676175]
We propose a simple but effective debiasing technique in an unsupervised manner.
We perform clustering on the feature embedding space and identify pseudoattributes by taking advantage of the clustering results.
We then employ a novel cluster-based reweighting scheme for learning debiased representation.
arXiv Detail & Related papers (2021-08-06T05:20:46Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Identifying Wrongly Predicted Samples: A Method for Active Learning [6.976600214375139]
We propose a simple sample selection criterion that moves beyond uncertainty.
We show state-of-the-art results and better rates at identifying wrongly predicted samples.
arXiv Detail & Related papers (2020-10-14T09:00:42Z) - Achieving Equalized Odds by Resampling Sensitive Attributes [13.114114427206678]
We present a flexible framework for learning predictive models that approximately satisfy the equalized odds notion of fairness.
This differentiable functional is used as a penalty driving the model parameters towards equalized odds.
We develop a formal hypothesis test to detect whether a prediction rule violates this property, the first such test in the literature.
arXiv Detail & Related papers (2020-06-08T00:18:34Z) - Minority Class Oversampling for Tabular Data with Deep Generative Models [4.976007156860967]
We study the ability of deep generative models to provide realistic samples that improve performance on imbalanced classification tasks via oversampling.
Our experiments show that the way the method of sampling does not affect quality, but runtime varies widely.
We also observe that the improvements in terms of performance metric, while shown to be significant, often are minor in absolute terms.
arXiv Detail & Related papers (2020-05-07T21:35:57Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z) - Dynamic Decision Boundary for One-class Classifiers applied to
non-uniformly Sampled Data [0.9569316316728905]
A typical issue in Pattern Recognition is the non-uniformly sampled data.
In this paper, we propose a one-class classifier based on the minimum spanning tree with a dynamic decision boundary.
arXiv Detail & Related papers (2020-04-05T18:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.