Related papers: A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

URL: http://arxiv.org/abs/2404.16958v2
Date: Tue, 2 Jul 2024 08:53:09 GMT
Title: A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice
Authors: Juri Opitz,
Abstract summary: Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Many works use so-called'macro' metrics to rank systems but do not clearly specify what they would expect from such a metric.
Score: 6.091702876917282
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called 'macro' metrics to rank systems (e.g., 'macro F1') but do not clearly specify what they would expect from such a `macro' metric. This is problematic, since picking a metric can affect research findings, and thus any clarity in the process should be maximized. Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics. The analysis helps us understand the metrics' underlying properties, and how they align with expectations as found expressed in papers. Then we reflect on the practical situation in the field, and survey evaluation practice in recent shared tasks. We find that metric selection is often not supported with convincing arguments, an issue that can make a system ranking seem arbitrary. Our work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.

Related papers

Metritocracy: Representative Metrics for Lite Benchmarks [3.0936354370614607]
We use ideas from social choice theory to formalize two notions of representation for the selection of a subset of evaluation metrics.<n>We first introduce positional representation, which guarantees every alternative is sufficiently represented at every position cutoff.<n>We then introduce positional proportionality, which guarantees no alternative is proportionally over- or under-represented by more than a small error at any position.
arXiv Detail & Related papers (2025-06-11T14:53:47Z)
Ranking evaluation metrics from a group-theoretic perspective [5.333192842860574]
We show instances resulting in inconsistent evaluations, sources of potential mistrust in commonly used metrics. Our analysis sheds light on ranking evaluation metrics, highlighting that inconsistent evaluations should not be seen as a source of mistrust.
arXiv Detail & Related papers (2024-08-14T09:06:58Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
$F_β$-plot -- a visual tool for evaluating imbalanced data classifiers [0.0]
The paper proposes a simple approach to analyzing the popular parametric metric $F_beta$. It is possible to indicate for a given pool of analyzed classifiers when a given model should be preferred depending on user requirements.
arXiv Detail & Related papers (2024-04-11T18:07:57Z)
Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Revisiting Evaluation Metrics for Semantic Segmentation: Optimization and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics. These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing. Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z)
Evaluation of FEM and MLFEM AI-explainers in Image Classification tasks with reference-based and no-reference metrics [0.0]
We remind recently proposed post-hoc explainers FEM and MLFEM which have been designed for explanations of CNNs in image and video classification tasks. We propose their evaluation with reference-based and no-reference metrics. As a no-reference metric we use "stability" metric, proposed by Alvarez-Melis and Jaakkola.
arXiv Detail & Related papers (2022-12-02T14:55:31Z)
Classification Performance Metric Elicitation and its Applications [5.5637552942511155]
Despite its practical interest, there is limited formal guidance on how to select metrics for machine learning applications. This thesis outlines metric elicitation as a principled framework for selecting the performance metric that best reflects implicit user preferences.
arXiv Detail & Related papers (2022-08-19T03:57:17Z)
On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations [74.70957445600936]
Multiple metrics have been introduced to measure fairness in various natural language processing tasks. These metrics can be roughly categorized into two categories: 1) emphextrinsic metrics for evaluating fairness in downstream applications and 2) emphintrinsic metrics for estimating fairness in upstream language representation models.
arXiv Detail & Related papers (2022-03-25T22:17:43Z)
On Quantitative Evaluations of Counterfactuals [88.42660013773647]
This paper consolidates work on evaluating visual counterfactual examples through an analysis and experiments. We find that while most metrics behave as intended for sufficiently simple datasets, some fail to tell the difference between good and bad counterfactuals when the complexity increases. We propose two new metrics, the Label Variation Score and the Oracle score, which are both less vulnerable to such tiny changes.
arXiv Detail & Related papers (2021-10-30T05:00:36Z)
Estimation of Fair Ranking Metrics with Incomplete Judgments [70.37717864975387]
We propose a sampling strategy and estimation technique for four fair ranking metrics. We formulate a robust and unbiased estimator which can operate even with very limited number of labeled items.
arXiv Detail & Related papers (2021-08-11T10:57:00Z)
Quantitative Evaluations on Saliency Methods: An Experimental Study [6.290238942982972]
We briefly summarize the status quo of the metrics, including faithfulness, localization, false-positives, sensitivity check, and stability. We conclude that among all the methods we compare, no single explanation method dominates others in all metrics.
arXiv Detail & Related papers (2020-12-31T14:13:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.