Decoding machine learning benchmarks
- URL: http://arxiv.org/abs/2007.14870v2
- Date: Wed, 19 Aug 2020 20:08:48 GMT
- Title: Decoding machine learning benchmarks
- Authors: Lucas F. F. Cardoso, Vitor C. A. Santos, Regiane S. K. Franc\^es,
Ricardo B. C. Prud\^encio and Ronnie C. O. Alves
- Abstract summary: Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good machine learning benchmark.
IRT was applied to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the availability of benchmark machine learning (ML) repositories
(e.g., UCI, OpenML), there is no standard evaluation strategy yet capable of
pointing out which is the best set of datasets to serve as gold standard to
test different ML algorithms. In recent studies, Item Response Theory (IRT) has
emerged as a new approach to elucidate what should be a good ML benchmark. This
work applied IRT to explore the well-known OpenML-CC18 benchmark to identify
how suitable it is on the evaluation of classifiers. Several classifiers
ranging from classical to ensembles ones were evaluated using IRT models, which
could simultaneously estimate dataset difficulty and classifiers' ability. The
Glicko-2 rating system was applied on the top of IRT to summarize the innate
ability and aptitude of classifiers. It was observed that not all datasets from
OpenML-CC18 are really useful to evaluate classifiers. Most datasets evaluated
in this work (84%) contain easy instances in general (e.g., around 10% of
difficult instances only). Also, 80% of the instances in half of this benchmark
are very discriminating ones, which can be of great use for pairwise algorithm
comparison, but not useful to push classifiers abilities. This paper presents
this new evaluation methodology based on IRT as well as the tool decodIRT,
developed to guide IRT estimation over ML benchmarks.
Related papers
- Rethinking Few-shot Class-incremental Learning: Learning from Yourself [31.268559330366404]
Few-shot class-incremental learning (FSCIL) aims to learn sequential classes with limited samples in a few-shot fashion.
Inherited from the classical class-incremental learning setting, the popular benchmark of FSCIL uses averaged accuracy (aAcc) and last-task averaged accuracy (lAcc) as the evaluation metrics.
We offer a new metric called generalized average accuracy (gAcc) which is designed to provide an extra equitable evaluation.
arXiv Detail & Related papers (2024-07-10T08:52:56Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [121.0693322732454]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity.
Recent research has focused on developing efficient fine-tuning methods to enhance CLIP's performance in downstream tasks.
We revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.
arXiv Detail & Related papers (2024-02-06T15:45:27Z) - Machine Learning Capability: A standardized metric using case difficulty
with applications to individualized deployment of supervised machine learning [2.2060666847121864]
Model evaluation is a critical component in supervised machine learning classification analyses.
Items Response Theory (IRT) and Computer Adaptive Testing (CAT) with machine learning can benchmark datasets independent of the end-classification results.
arXiv Detail & Related papers (2023-02-09T00:38:42Z) - Decision Making for Hierarchical Multi-label Classification with
Multidimensional Local Precision Rate [4.812468844362369]
We introduce a new statistic called the multidimensional local precision rate (mLPR) for each object in each class.
We show that classification decisions made by simply sorting objects across classes in descending order of their mLPRs can, in theory, ensure the class hierarchy.
In response, we introduce HierRank, a new algorithm that maximizes an empirical version of CATCH using estimated mLPRs while respecting the hierarchy.
arXiv Detail & Related papers (2022-05-16T17:43:35Z) - Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.
We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy.
We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z) - Data vs classifiers, who wins? [0.0]
The classification experiments covered by machine learning (ML) are composed by two important parts: the data and the algorithm.
Data complexity is commonly not considered along with the model during a performance evaluation.
Recent studies employ Item Response Theory (IRT) as a new approach to evaluating datasets and algorithms.
arXiv Detail & Related papers (2021-07-15T16:55:15Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - Minimum Variance Embedded Auto-associative Kernel Extreme Learning
Machine for One-class Classification [1.4146420810689422]
VAAKELM is a novel extension of an auto-associative kernel extreme learning machine.
It embeds minimum variance information within its architecture and reduces the intra-class variance.
It follows a reconstruction-based approach to one-class classification and minimizes the reconstruction error.
arXiv Detail & Related papers (2020-11-24T17:00:30Z) - Fast Few-Shot Classification by Few-Iteration Meta-Learning [173.32497326674775]
We introduce a fast optimization-based meta-learning method for few-shot classification.
Our strategy enables important aspects of the base learner objective to be learned during meta-training.
We perform a comprehensive experimental analysis, demonstrating the speed and effectiveness of our approach.
arXiv Detail & Related papers (2020-10-01T15:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.