mlscorecheck: Testing the consistency of reported performance scores and
experiments in machine learning
- URL: http://arxiv.org/abs/2311.07541v1
- Date: Mon, 13 Nov 2023 18:31:48 GMT
- Title: mlscorecheck: Testing the consistency of reported performance scores and
experiments in machine learning
- Authors: Gy\"orgy Kov\'acs and Attila Fazekas
- Abstract summary: We have developed numerical techniques capable of identifying inconsistencies between reported performance scores and various experimental setups in machine learning problems.
These consistency tests are integrated into the open-source package mlscorecheck.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Addressing the reproducibility crisis in artificial intelligence through the
validation of reported experimental results is a challenging task. It
necessitates either the reimplementation of techniques or a meticulous
assessment of papers for deviations from the scientific method and best
statistical practices. To facilitate the validation of reported results, we
have developed numerical techniques capable of identifying inconsistencies
between reported performance scores and various experimental setups in machine
learning problems, including binary/multiclass classification and regression.
These consistency tests are integrated into the open-source package
mlscorecheck, which also provides specific test bundles designed to detect
systematically recurring flaws in various fields, such as retina image
processing and synthetic minority oversampling.
Related papers
- Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples [53.95282502030541]
Neural Network-based active learning (NAL) is a cost-effective data selection technique that utilizes neural networks to select and train on a small subset of samples.
We try to move one step forward by offering a unified explanation for the success of both query criteria-based NAL from a feature learning view.
arXiv Detail & Related papers (2024-06-06T10:38:01Z) - FlaKat: A Machine Learning-Based Categorization Framework for Flaky
Tests [3.0846824529023382]
Flaky tests can pass or fail non-deterministically, without alterations to a software system.
State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy.
arXiv Detail & Related papers (2024-03-01T22:00:44Z) - Deep anytime-valid hypothesis testing [29.273915933729057]
We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems.
We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework.
Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
arXiv Detail & Related papers (2023-10-30T09:46:19Z) - Testing the Consistency of Performance Scores Reported for Binary
Classification Problems [0.0]
We introduce numerical techniques to assess the consistency of reported performance scores and the assumed experimental setup.
We demonstrate how the proposed techniques can effectively detect inconsistencies, thereby safeguarding the integrity of research fields.
To benefit the scientific community, we have made the consistency tests available in an open-source Python package.
arXiv Detail & Related papers (2023-10-19T07:04:29Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Learn then Test: Calibrating Predictive Algorithms to Achieve Risk
Control [67.52000805944924]
Learn then Test (LTT) is a framework for calibrating machine learning models.
Our main insight is to reframe the risk-control problem as multiple hypothesis testing.
We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
arXiv Detail & Related papers (2021-10-03T17:42:03Z) - Efficient and accurate group testing via Belief Propagation: an
empirical study [5.706360286474043]
Group testing problem asks for efficient pooling schemes and algorithms.
The goal is to accurately identify the infected samples while conducting the least possible number of tests.
We suggest a new test design that significantly increases the accuracy of the results.
arXiv Detail & Related papers (2021-05-13T10:52:46Z) - Cross-validation Confidence Intervals for Test Error [83.67415139421448]
This work develops central limit theorems for crossvalidation and consistent estimators of its variance under weak stability conditions on the learning algorithm.
Results are the first of their kind for the popular choice of leave-one-out cross-validation.
arXiv Detail & Related papers (2020-07-24T17:40:06Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.