Related papers: Analysis and Comparison of Classification Metrics

Analysis and Comparison of Classification Metrics

URL: http://arxiv.org/abs/2209.05355v4
Date: Wed, 20 Sep 2023 20:20:45 GMT
Title: Analysis and Comparison of Classification Metrics
Authors: Luciana Ferrer
Abstract summary: Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk. We show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE)
Score: 12.092755413404245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A variety of different performance metrics are commonly used in the machine learning literature for the evaluation of classification systems. Some of the most common ones for measuring quality of hard decisions are standard and balanced accuracy, standard and balanced error rate, F-beta score, and Matthews correlation coefficient (MCC). In this document, we review the definition of these and other metrics and compare them with the expected cost (EC), a metric introduced in every statistical learning course but rarely used in the machine learning literature. We show that both the standard and balanced error rates are special cases of the EC. Further, we show its relation with F-beta score and MCC and argue that EC is superior to these traditional metrics for being based on first principles from statistics, and for being more general, interpretable, and adaptable to any application scenario. The metrics mentioned above measure the quality of hard decisions. Yet, most modern classification systems output continuous scores for the classes which we may want to evaluate directly. Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk, among others. The last three metrics are special cases of a family of metrics given by the expected value of proper scoring rules (PSRs). We review the theory behind these metrics, showing that they are a principled way to measure the quality of the posterior probabilities produced by a system. Finally, we show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE), arguing that calibration loss based on PSRs is superior to the ECE for being more interpretable, more general, and directly applicable to the multi-class case, among other reasons.

Related papers

Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics [47.762333925222926]
We present a novel metric to quantify biased behaviors of machine learning models. We focus on and apply it to the operational evaluation of face recognition systems.
arXiv Detail & Related papers (2024-09-03T14:19:38Z)
Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration [10.604555099281173]
We argue that calibration metrics should play no role in the assessment of posterior quality. We discuss a simple and practical calibration metric, called calibration loss, derived from a decomposition of expected PSRs.
arXiv Detail & Related papers (2024-08-05T21:35:51Z)
$F_β$-plot -- a visual tool for evaluating imbalanced data classifiers [0.0]
The paper proposes a simple approach to analyzing the popular parametric metric $F_beta$. It is possible to indicate for a given pool of analyzed classifiers when a given model should be preferred depending on user requirements.
arXiv Detail & Related papers (2024-04-11T18:07:57Z)
Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [49.3814117521631]
Standard benchmarks of bias and fairness in large language models (LLMs) measure the association between social attributes implied in user prompts and short responses. We develop analogous RUTEd evaluations from three contexts of real-world use. We find that standard bias metrics have no significant correlation with the more realistic bias metrics.
arXiv Detail & Related papers (2024-02-20T01:49:15Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability [6.510061176722249]
We argue that more expressive metrics must be developed that accurately measure calibration error. We use a generalization of Expected Error (ECE) that measure calibration error under different definitions of reliability. We find that: 1) definitions ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics.
arXiv Detail & Related papers (2022-05-23T16:45:02Z)
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z)
The statistical advantage of automatic NLG metrics at the system level [10.540821585237222]
Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected.
arXiv Detail & Related papers (2021-05-26T09:53:57Z)
Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration. We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.