Analysis and Comparison of Classification Metrics
- URL: http://arxiv.org/abs/2209.05355v4
- Date: Wed, 20 Sep 2023 20:20:45 GMT
- Title: Analysis and Comparison of Classification Metrics
- Authors: Luciana Ferrer
- Abstract summary: Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk.
We show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE)
- Score: 12.092755413404245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A variety of different performance metrics are commonly used in the machine
learning literature for the evaluation of classification systems. Some of the
most common ones for measuring quality of hard decisions are standard and
balanced accuracy, standard and balanced error rate, F-beta score, and Matthews
correlation coefficient (MCC). In this document, we review the definition of
these and other metrics and compare them with the expected cost (EC), a metric
introduced in every statistical learning course but rarely used in the machine
learning literature. We show that both the standard and balanced error rates
are special cases of the EC. Further, we show its relation with F-beta score
and MCC and argue that EC is superior to these traditional metrics for being
based on first principles from statistics, and for being more general,
interpretable, and adaptable to any application scenario. The metrics mentioned
above measure the quality of hard decisions. Yet, most modern classification
systems output continuous scores for the classes which we may want to evaluate
directly. Metrics for measuring the quality of system scores include the area
under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC
or Bayes risk, among others. The last three metrics are special cases of a
family of metrics given by the expected value of proper scoring rules (PSRs).
We review the theory behind these metrics, showing that they are a principled
way to measure the quality of the posterior probabilities produced by a system.
Finally, we show how to use these metrics to compute a system's calibration
loss and compare this metric with the widely-used expected calibration error
(ECE), arguing that calibration loss based on PSRs is superior to the ECE for
being more interpretable, more general, and directly applicable to the
multi-class case, among other reasons.
Related papers
- $F_β$-plot -- a visual tool for evaluating imbalanced data classifiers [0.0]
The paper proposes a simple approach to analyzing the popular parametric metric $F_beta$.
It is possible to indicate for a given pool of analyzed classifiers when a given model should be preferred depending on user requirements.
arXiv Detail & Related papers (2024-04-11T18:07:57Z) - Revisiting Meta-evaluation for Grammatical Error Correction [14.822205658480813]
SEEDA is a new dataset for GEC meta-evaluation.
It consists of corrections with human ratings along two different granularities.
The results suggest that edit-based metrics may have been underestimated in existing studies.
arXiv Detail & Related papers (2024-03-05T05:53:09Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - What is Your Metric Telling You? Evaluating Classifier Calibration under
Context-Specific Definitions of Reliability [6.510061176722249]
We argue that more expressive metrics must be developed that accurately measure calibration error.
We use a generalization of Expected Error (ECE) that measure calibration error under different definitions of reliability.
We find that: 1) definitions ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics.
arXiv Detail & Related papers (2022-05-23T16:45:02Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Measuring Disparate Outcomes of Content Recommendation Algorithms with
Distributional Inequality Metrics [5.74271110290378]
We evaluate a set of metrics originating from economics, distributional inequality metrics, and their ability to measure disparities in content exposure in the Twitter algorithmic timeline.
We show that we can use these metrics to identify content suggestion algorithms that contribute more strongly to skewed outcomes between users.
arXiv Detail & Related papers (2022-02-03T14:41:39Z) - Evaluating Metrics for Bias in Word Embeddings [64.55554083622258]
We formalize a bias definition based on the ideas from previous works and derive conditions for bias metrics.
We propose a new metric, SAME, to address the shortcomings of existing metrics and mathematically prove that SAME behaves appropriately.
arXiv Detail & Related papers (2021-11-15T16:07:15Z) - The statistical advantage of automatic NLG metrics at the system level [10.540821585237222]
Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators.
We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap.
Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected.
arXiv Detail & Related papers (2021-05-26T09:53:57Z) - Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration.
We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.