Analysis and Comparison of Classification Metrics
- URL: http://arxiv.org/abs/2209.05355v4
- Date: Wed, 20 Sep 2023 20:20:45 GMT
- Title: Analysis and Comparison of Classification Metrics
- Authors: Luciana Ferrer
- Abstract summary: Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk.
We show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE)
- Score: 12.092755413404245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A variety of different performance metrics are commonly used in the machine
learning literature for the evaluation of classification systems. Some of the
most common ones for measuring quality of hard decisions are standard and
balanced accuracy, standard and balanced error rate, F-beta score, and Matthews
correlation coefficient (MCC). In this document, we review the definition of
these and other metrics and compare them with the expected cost (EC), a metric
introduced in every statistical learning course but rarely used in the machine
learning literature. We show that both the standard and balanced error rates
are special cases of the EC. Further, we show its relation with F-beta score
and MCC and argue that EC is superior to these traditional metrics for being
based on first principles from statistics, and for being more general,
interpretable, and adaptable to any application scenario. The metrics mentioned
above measure the quality of hard decisions. Yet, most modern classification
systems output continuous scores for the classes which we may want to evaluate
directly. Metrics for measuring the quality of system scores include the area
under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC
or Bayes risk, among others. The last three metrics are special cases of a
family of metrics given by the expected value of proper scoring rules (PSRs).
We review the theory behind these metrics, showing that they are a principled
way to measure the quality of the posterior probabilities produced by a system.
Finally, we show how to use these metrics to compute a system's calibration
loss and compare this metric with the widely-used expected calibration error
(ECE), arguing that calibration loss based on PSRs is superior to the ECE for
being more interpretable, more general, and directly applicable to the
multi-class case, among other reasons.
Related papers
- Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics [47.762333925222926]
We present a novel metric to quantify biased behaviors of machine learning models.
We focus on and apply it to the operational evaluation of face recognition systems.
arXiv Detail & Related papers (2024-09-03T14:19:38Z) - Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration [10.604555099281173]
We argue that calibration metrics should play no role in the assessment of posterior quality.
We discuss a simple and practical calibration metric, called calibration loss, derived from a decomposition of expected PSRs.
arXiv Detail & Related papers (2024-08-05T21:35:51Z) - $F_β$-plot -- a visual tool for evaluating imbalanced data classifiers [0.0]
The paper proposes a simple approach to analyzing the popular parametric metric $F_beta$.
It is possible to indicate for a given pool of analyzed classifiers when a given model should be preferred depending on user requirements.
arXiv Detail & Related papers (2024-04-11T18:07:57Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - What is Your Metric Telling You? Evaluating Classifier Calibration under
Context-Specific Definitions of Reliability [6.510061176722249]
We argue that more expressive metrics must be developed that accurately measure calibration error.
We use a generalization of Expected Error (ECE) that measure calibration error under different definitions of reliability.
We find that: 1) definitions ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics.
arXiv Detail & Related papers (2022-05-23T16:45:02Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - The statistical advantage of automatic NLG metrics at the system level [10.540821585237222]
Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators.
We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap.
Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected.
arXiv Detail & Related papers (2021-05-26T09:53:57Z) - Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration.
We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.