We Need to Talk About Classification Evaluation Metrics in NLP
- URL: http://arxiv.org/abs/2401.03831v1
- Date: Mon, 8 Jan 2024 11:40:48 GMT
- Title: We Need to Talk About Classification Evaluation Metrics in NLP
- Authors: Peter Vickers, Lo\"ic Barrault, Emilio Monti, Nikolaos Aletras
- Abstract summary: In Natural Language Processing (NLP) model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC.
The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use.
We demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance.
- Score: 34.73017509294468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Natural Language Processing (NLP) classification tasks such as topic
categorisation and sentiment analysis, model generalizability is generally
measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The
diversity of metrics, and the arbitrariness of their application suggest that
there is no agreement within NLP on a single best metric to use. This lack
suggests there has not been sufficient examination of the underlying heuristics
which each metric encodes. To address this we compare several standard
classification metrics with more 'exotic' metrics and demonstrate that a
random-guess normalised Informedness metric is a parsimonious baseline for task
performance. To show how important the choice of metric is, we perform
extensive experiments on a wide range of NLP tasks including a synthetic
scenario, natural language understanding, question answering and machine
translation. Across these tasks we use a superset of metrics to rank models and
find that Informedness best captures the ideal model characteristics. Finally,
we release a Python implementation of Informedness following the SciKitLearn
classifier format.
Related papers
- Breeding Machine Translations: Evolutionary approach to survive and
thrive in the world of automated evaluation [1.90365714903665]
We propose a genetic algorithm (GA) based method for modifying n-best lists produced by a machine translation (MT) system.
Our method offers an innovative approach to improving MT quality and identifying weaknesses in evaluation metrics.
arXiv Detail & Related papers (2023-05-30T18:00:25Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Self-Adaptive Label Augmentation for Semi-supervised Few-shot
Classification [121.63992191386502]
Few-shot classification aims to learn a model that can generalize well to new tasks when only a few labeled samples are available.
We propose a semi-supervised few-shot classification method that assigns an appropriate label to each unlabeled sample by a manually defined metric.
A major novelty of SALA is the task-adaptive metric, which can learn the metric adaptively for different tasks in an end-to-end fashion.
arXiv Detail & Related papers (2022-06-16T13:14:03Z) - A global analysis of metrics used for measuring performance in natural
language processing [9.433496814327086]
We provide the first large-scale cross-sectional analysis of metrics used for measuring performance in natural language processing.
Results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models' performance.
arXiv Detail & Related papers (2022-04-25T11:41:50Z) - Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z) - Meta-Generating Deep Attentive Metric for Few-shot Classification [53.07108067253006]
We present a novel deep metric meta-generation method to generate a specific metric for a new few-shot learning task.
In this study, we structure the metric using a three-layer deep attentive network that is flexible enough to produce a discriminative metric for each task.
We gain surprisingly obvious performance improvement over state-of-the-art competitors, especially in the challenging cases.
arXiv Detail & Related papers (2020-12-03T02:07:43Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.