Related papers: We Need to Talk About Classification Evaluation Metrics in NLP

We Need to Talk About Classification Evaluation Metrics in NLP

URL: http://arxiv.org/abs/2401.03831v1
Date: Mon, 8 Jan 2024 11:40:48 GMT
Title: We Need to Talk About Classification Evaluation Metrics in NLP
Authors: Peter Vickers, Lo\"ic Barrault, Emilio Monti, Nikolaos Aletras
Abstract summary: In Natural Language Processing (NLP) model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use. We demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance.
Score: 34.73017509294468
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In Natural Language Processing (NLP) classification tasks such as topic categorisation and sentiment analysis, model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use. This lack suggests there has not been sufficient examination of the underlying heuristics which each metric encodes. To address this we compare several standard classification metrics with more 'exotic' metrics and demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance. To show how important the choice of metric is, we perform extensive experiments on a wide range of NLP tasks including a synthetic scenario, natural language understanding, question answering and machine translation. Across these tasks we use a superset of metrics to rank models and find that Informedness best captures the ideal model characteristics. Finally, we release a Python implementation of Informedness following the SciKitLearn classifier format.

Related papers

Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks [0.0]
We show how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks. These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance.
arXiv Detail & Related papers (2025-01-08T02:17:34Z)
Breeding Machine Translations: Evolutionary approach to survive and thrive in the world of automated evaluation [1.90365714903665]
We propose a genetic algorithm (GA) based method for modifying n-best lists produced by a machine translation (MT) system. Our method offers an innovative approach to improving MT quality and identifying weaknesses in evaluation metrics.
arXiv Detail & Related papers (2023-05-30T18:00:25Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations. We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z)
Self-Adaptive Label Augmentation for Semi-supervised Few-shot Classification [121.63992191386502]
Few-shot classification aims to learn a model that can generalize well to new tasks when only a few labeled samples are available. We propose a semi-supervised few-shot classification method that assigns an appropriate label to each unlabeled sample by a manually defined metric. A major novelty of SALA is the task-adaptive metric, which can learn the metric adaptively for different tasks in an end-to-end fashion.
arXiv Detail & Related papers (2022-06-16T13:14:03Z)
A global analysis of metrics used for measuring performance in natural language processing [9.433496814327086]
We provide the first large-scale cross-sectional analysis of metrics used for measuring performance in natural language processing. Results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models' performance.
arXiv Detail & Related papers (2022-04-25T11:41:50Z)
Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics. Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z)
Meta-Generating Deep Attentive Metric for Few-shot Classification [53.07108067253006]
We present a novel deep metric meta-generation method to generate a specific metric for a new few-shot learning task. In this study, we structure the metric using a three-layer deep attentive network that is flexible enough to produce a discriminative metric for each task. We gain surprisingly obvious performance improvement over state-of-the-art competitors, especially in the challenging cases.
arXiv Detail & Related papers (2020-12-03T02:07:43Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.