Related papers: mlscorecheck: Testing the consistency of reported performance scores and experiments in machine learning

mlscorecheck: Testing the consistency of reported performance scores and experiments in machine learning

URL: http://arxiv.org/abs/2311.07541v1
Date: Mon, 13 Nov 2023 18:31:48 GMT
Title: mlscorecheck: Testing the consistency of reported performance scores and experiments in machine learning
Authors: Gy\"orgy Kov\'acs and Attila Fazekas
Abstract summary: We have developed numerical techniques capable of identifying inconsistencies between reported performance scores and various experimental setups in machine learning problems. These consistency tests are integrated into the open-source package mlscorecheck.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Addressing the reproducibility crisis in artificial intelligence through the validation of reported experimental results is a challenging task. It necessitates either the reimplementation of techniques or a meticulous assessment of papers for deviations from the scientific method and best statistical practices. To facilitate the validation of reported results, we have developed numerical techniques capable of identifying inconsistencies between reported performance scores and various experimental setups in machine learning problems, including binary/multiclass classification and regression. These consistency tests are integrated into the open-source package mlscorecheck, which also provides specific test bundles designed to detect systematically recurring flaws in various fields, such as retina image processing and synthetic minority oversampling.

Related papers

MindfulLIME: A Stable Solution for Explanations of Machine Learning Models with Enhanced Localization Precision -- A Medical Image Case Study [0.7373617024876725]
We propose MindfulLIME, a novel algorithm that generates visual explanations using a graph-based pruning algorithm and uncertainty sampling. Our experimental evaluation, conducted on a widely recognized chest X-ray dataset, confirms MindfulLIME's stability with a 100% success rate. MindfulLIME improves the localization precision of visual explanations by reducing the distance between the generated explanations and the actual local annotations.
arXiv Detail & Related papers (2025-03-25T14:48:14Z)
A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection [55.2480439325792]
This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy. We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods. By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications.
arXiv Detail & Related papers (2024-12-11T22:12:21Z)
Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning [4.325075044327162]
Supervised contrastive learning has achieved remarkable success by leveraging label information. However, determining positive samples in multi-label scenarios remains a critical challenge.
arXiv Detail & Related papers (2024-10-17T11:12:55Z)
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy. However, the reliability of test accuracy as the primary performance metric has been called into question. The distribution of hard samples between training and test sets affects the difficulty levels of those sets. We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z)
Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples [53.95282502030541]
Neural Network-based active learning (NAL) is a cost-effective data selection technique that utilizes neural networks to select and train on a small subset of samples. We try to move one step forward by offering a unified explanation for the success of both query criteria-based NAL from a feature learning view.
arXiv Detail & Related papers (2024-06-06T10:38:01Z)
FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests [3.0846824529023382]
Flaky tests can pass or fail non-deterministically, without alterations to a software system. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy.
arXiv Detail & Related papers (2024-03-01T22:00:44Z)
Deep anytime-valid hypothesis testing [29.273915933729057]
We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems. We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework. Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
arXiv Detail & Related papers (2023-10-30T09:46:19Z)
Testing the Consistency of Performance Scores Reported for Binary Classification Problems [0.0]
We introduce numerical techniques to assess the consistency of reported performance scores and the assumed experimental setup. We demonstrate how the proposed techniques can effectively detect inconsistencies, thereby safeguarding the integrity of research fields. To benefit the scientific community, we have made the consistency tests available in an open-source Python package.
arXiv Detail & Related papers (2023-10-19T07:04:29Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control [67.52000805944924]
Learn then Test (LTT) is a framework for calibrating machine learning models. Our main insight is to reframe the risk-control problem as multiple hypothesis testing. We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
arXiv Detail & Related papers (2021-10-03T17:42:03Z)
Efficient and accurate group testing via Belief Propagation: an empirical study [5.706360286474043]
Group testing problem asks for efficient pooling schemes and algorithms. The goal is to accurately identify the infected samples while conducting the least possible number of tests. We suggest a new test design that significantly increases the accuracy of the results.
arXiv Detail & Related papers (2021-05-13T10:52:46Z)
Cross-validation Confidence Intervals for Test Error [83.67415139421448]
This work develops central limit theorems for crossvalidation and consistent estimators of its variance under weak stability conditions on the learning algorithm. Results are the first of their kind for the popular choice of leave-one-out cross-validation.
arXiv Detail & Related papers (2020-07-24T17:40:06Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.