Related papers: Decoding machine learning benchmarks

Decoding machine learning benchmarks

URL: http://arxiv.org/abs/2007.14870v2
Date: Wed, 19 Aug 2020 20:08:48 GMT
Title: Decoding machine learning benchmarks
Authors: Lucas F. F. Cardoso, Vitor C. A. Santos, Regiane S. K. Franc\^es, Ricardo B. C. Prud\^encio and Ronnie C. O. Alves
Abstract summary: Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good machine learning benchmark. IRT was applied to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the availability of benchmark machine learning (ML) repositories (e.g., UCI, OpenML), there is no standard evaluation strategy yet capable of pointing out which is the best set of datasets to serve as gold standard to test different ML algorithms. In recent studies, Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good ML benchmark. This work applied IRT to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers. Several classifiers ranging from classical to ensembles ones were evaluated using IRT models, which could simultaneously estimate dataset difficulty and classifiers' ability. The Glicko-2 rating system was applied on the top of IRT to summarize the innate ability and aptitude of classifiers. It was observed that not all datasets from OpenML-CC18 are really useful to evaluate classifiers. Most datasets evaluated in this work (84%) contain easy instances in general (e.g., around 10% of difficult instances only). Also, 80% of the instances in half of this benchmark are very discriminating ones, which can be of great use for pairwise algorithm comparison, but not useful to push classifiers abilities. This paper presents this new evaluation methodology based on IRT as well as the tool decodIRT, developed to guide IRT estimation over ML benchmarks.

Related papers

Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness [0.4749981032986242]
This study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system. IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics. A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging.
arXiv Detail & Related papers (2025-04-13T23:54:08Z)
MLMC: Interactive multi-label multi-classifier evaluation without confusion matrices [52.476815843373515]
Machine-C is a visual exploration tool that tackles the challenge of multi-label comparison and evaluation. Our study shows that the techniques implemented by Machine-C allow for a powerful multi-label classifier evaluation while preserving user friendliness.
arXiv Detail & Related papers (2025-01-24T12:43:36Z)
How to Make LLMs Strong Node Classifiers? [70.14063765424012]
Language Models (LMs) are challenging the dominance of domain-specific models, such as Graph Neural Networks (GNNs) and Graph Transformers (GTs) We propose a novel approach that empowers off-the-shelf LMs to achieve performance comparable to state-of-the-art (SOTA) GNNs on node classification tasks.
arXiv Detail & Related papers (2024-10-03T08:27:54Z)
Rethinking Few-shot Class-incremental Learning: Learning from Yourself [31.268559330366404]
Few-shot class-incremental learning (FSCIL) aims to learn sequential classes with limited samples in a few-shot fashion. Inherited from the classical class-incremental learning setting, the popular benchmark of FSCIL uses averaged accuracy (aAcc) and last-task averaged accuracy (lAcc) as the evaluation metrics. We offer a new metric called generalized average accuracy (gAcc) which is designed to provide an extra equitable evaluation.
arXiv Detail & Related papers (2024-07-10T08:52:56Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [121.0693322732454]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods to enhance CLIP's performance in downstream tasks. We revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.
arXiv Detail & Related papers (2024-02-06T15:45:27Z)
Machine Learning Capability: A standardized metric using case difficulty with applications to individualized deployment of supervised machine learning [2.2060666847121864]
Model evaluation is a critical component in supervised machine learning classification analyses. Items Response Theory (IRT) and Computer Adaptive Testing (CAT) with machine learning can benchmark datasets independent of the end-classification results.
arXiv Detail & Related papers (2023-02-09T00:38:42Z)
Decision Making for Hierarchical Multi-label Classification with Multidimensional Local Precision Rate [4.812468844362369]
We introduce a new statistic called the multidimensional local precision rate (mLPR) for each object in each class. We show that classification decisions made by simply sorting objects across classes in descending order of their mLPRs can, in theory, ensure the class hierarchy. In response, we introduce HierRank, a new algorithm that maximizes an empirical version of CATCH using estimated mLPRs while respecting the hierarchy.
arXiv Detail & Related papers (2022-05-16T17:43:35Z)
Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes. We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy. We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z)
Data vs classifiers, who wins? [0.0]
The classification experiments covered by machine learning (ML) are composed by two important parts: the data and the algorithm. Data complexity is commonly not considered along with the model during a performance evaluation. Recent studies employ Item Response Theory (IRT) as a new approach to evaluating datasets and algorithms.
arXiv Detail & Related papers (2021-07-15T16:55:15Z)
The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process. We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z)
No Fear of Heterogeneity: Classifier Calibration for Federated Learning with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data. We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model. Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z)
Minimum Variance Embedded Auto-associative Kernel Extreme Learning Machine for One-class Classification [1.4146420810689422]
VAAKELM is a novel extension of an auto-associative kernel extreme learning machine. It embeds minimum variance information within its architecture and reduces the intra-class variance. It follows a reconstruction-based approach to one-class classification and minimizes the reconstruction error.
arXiv Detail & Related papers (2020-11-24T17:00:30Z)
Fast Few-Shot Classification by Few-Iteration Meta-Learning [173.32497326674775]
We introduce a fast optimization-based meta-learning method for few-shot classification. Our strategy enables important aspects of the base learner objective to be learned during meta-training. We perform a comprehensive experimental analysis, demonstrating the speed and effectiveness of our approach.
arXiv Detail & Related papers (2020-10-01T15:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.