Related papers: Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness

Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness

URL: http://arxiv.org/abs/2504.09759v1
Date: Sun, 13 Apr 2025 23:54:08 GMT
Title: Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness
Authors: Lucas Cardoso, Vitor Santos, José Ribeiro, Regiane Kawasaki, Ricardo Prudêncio, Ronnie Alves,
Abstract summary: This study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system.<n>IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics.<n>A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging.
Score: 0.4749981032986242
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarking is a fundamental practice in machine learning (ML) for comparing the performance of classification algorithms. However, traditional evaluation methods often overlook a critical aspect: the joint consideration of dataset complexity and an algorithm's ability to generalize. Without this dual perspective, assessments may favor models that perform well on easy instances while failing to capture their true robustness. To address this limitation, this study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system, originally developed to measure player strength in competitive games. IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics - such as rating, deviation, and volatility - via simulated tournaments between classifiers. This combined approach provides a fairer and more nuanced measure of algorithm capability. A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging and that a reduced subset with 50% of the original datasets offers comparable evaluation power. Among the algorithms tested, Random Forest achieved the highest ability score. The results highlight the importance of improving benchmark design by focusing on dataset quality and adopting evaluation strategies that reflect both difficulty and classifier proficiency.

Related papers

Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models [18.309464845180237]
We propose an efficient evaluation protocol for large vision-language models (VLMs)<n>We construct a subset that yields results comparable to full benchmark evaluations.<n>Applying FPS to an existing benchmark improves correlation with overall evaluation results.
arXiv Detail & Related papers (2025-04-14T08:43:00Z)
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection. Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains. We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z)
Using tournaments to calculate AUROC for zero-shot classification with LLMs [4.270472870948892]
Large language models perform surprisingly well on many zero-shot classification tasks. We propose and evaluate a method that converts binary classification tasks into pairwise comparison tasks. Repeated pairwise comparisons can be used to score instances using the Elo rating system.
arXiv Detail & Related papers (2025-02-20T20:13:20Z)
Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs.<n>In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process.<n>We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure.
arXiv Detail & Related papers (2024-10-14T01:57:25Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
Adaptive Hierarchical Similarity Metric Learning with Noisy Labels [138.41576366096137]
We propose an Adaptive Hierarchical Similarity Metric Learning method. It considers two noise-insensitive information, textiti.e., class-wise divergence and sample-wise consistency. Our method achieves state-of-the-art performance compared with current deep metric learning approaches.
arXiv Detail & Related papers (2021-10-29T02:12:18Z)
Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes. We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy. We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z)
Data vs classifiers, who wins? [0.0]
The classification experiments covered by machine learning (ML) are composed by two important parts: the data and the algorithm. Data complexity is commonly not considered along with the model during a performance evaluation. Recent studies employ Item Response Theory (IRT) as a new approach to evaluating datasets and algorithms.
arXiv Detail & Related papers (2021-07-15T16:55:15Z)
Decoding machine learning benchmarks [0.0]
Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good machine learning benchmark. IRT was applied to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers.
arXiv Detail & Related papers (2020-07-29T14:39:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.