Data vs classifiers, who wins?
- URL: http://arxiv.org/abs/2107.07451v2
- Date: Fri, 16 Jul 2021 15:19:51 GMT
- Title: Data vs classifiers, who wins?
- Authors: Lucas F. F. Cardoso, Vitor C. A. Santos, Regiane S. Kawasaki
Franc\^es, Ricardo B. C. Prud\^encio and Ronnie C. O. Alves
- Abstract summary: The classification experiments covered by machine learning (ML) are composed by two important parts: the data and the algorithm.
Data complexity is commonly not considered along with the model during a performance evaluation.
Recent studies employ Item Response Theory (IRT) as a new approach to evaluating datasets and algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The classification experiments covered by machine learning (ML) are composed
by two important parts: the data and the algorithm. As they are a fundamental
part of the problem, both must be considered when evaluating a model's
performance against a benchmark. The best classifiers need robust benchmarks to
be properly evaluated. For this, gold standard benchmarks such as OpenML-CC18
are used. However, data complexity is commonly not considered along with the
model during a performance evaluation. Recent studies employ Item Response
Theory (IRT) as a new approach to evaluating datasets and algorithms, capable
of evaluating both simultaneously. This work presents a new evaluation
methodology based on IRT and Glicko-2, jointly with the decodIRT tool developed
to guide the estimation of IRT in ML. It explores the IRT as a tool to evaluate
the OpenML-CC18 benchmark for its algorithmic evaluation capability and checks
if there is a subset of datasets more efficient than the original benchmark.
Several classifiers, from classics to ensemble, are also evaluated using the
IRT models. The Glicko-2 rating system was applied together with IRT to
summarize the innate ability and classifiers performance. It was noted that not
all OpenML-CC18 datasets are really useful for evaluating algorithms, where
only 10% were rated as being really difficult. Furthermore, it was verified the
existence of a more efficient subset containing only 50% of the original size.
While Randon Forest was singled out as the algorithm with the best innate
ability.
Related papers
- Evaluating Retrieval Quality in Retrieval-Augmented Generation [21.115495457454365]
Traditional end-to-end evaluation methods are computationally expensive.
We propose eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system.
eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.
arXiv Detail & Related papers (2024-04-21T21:22:28Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Machine Learning Capability: A standardized metric using case difficulty
with applications to individualized deployment of supervised machine learning [2.2060666847121864]
Model evaluation is a critical component in supervised machine learning classification analyses.
Items Response Theory (IRT) and Computer Adaptive Testing (CAT) with machine learning can benchmark datasets independent of the end-classification results.
arXiv Detail & Related papers (2023-02-09T00:38:42Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z) - Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.
We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy.
We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z) - Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms.
We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically.
Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z) - A Critical Assessment of State-of-the-Art in Entity Alignment [1.7725414095035827]
We investigate two state-of-the-art (SotA) methods for the task of Entity Alignment in Knowledge Graphs.
We first carefully examine the benchmarking process and identify several shortcomings, which make the results reported in the original works not always comparable.
arXiv Detail & Related papers (2020-10-30T15:09:19Z) - Decoding machine learning benchmarks [0.0]
Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good machine learning benchmark.
IRT was applied to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers.
arXiv Detail & Related papers (2020-07-29T14:39:41Z) - Pairwise Similarity Knowledge Transfer for Weakly Supervised Object
Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels.
In this work, we argue that learning only an objectness function is a weak form of knowledge transfer.
Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z) - Fase-AL -- Adaptation of Fast Adaptive Stacking of Ensembles for
Supporting Active Learning [0.0]
This work presents the FASE-AL algorithm which induces classification models with non-labeled instances using Active Learning.
The algorithm achieves promising results in terms of the percentage of correctly classified instances.
arXiv Detail & Related papers (2020-01-30T17:25:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.