A global analysis of metrics used for measuring performance in natural
language processing
- URL: http://arxiv.org/abs/2204.11574v1
- Date: Mon, 25 Apr 2022 11:41:50 GMT
- Title: A global analysis of metrics used for measuring performance in natural
language processing
- Authors: Kathrin Blagec and Georg Dorffner and Milad Moradi and Simon Ott and
Matthias Samwald
- Abstract summary: We provide the first large-scale cross-sectional analysis of metrics used for measuring performance in natural language processing.
Results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models' performance.
- Score: 9.433496814327086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Measuring the performance of natural language processing models is
challenging. Traditionally used metrics, such as BLEU and ROUGE, originally
devised for machine translation and summarization, have been shown to suffer
from low correlation with human judgment and a lack of transferability to other
tasks and languages. In the past 15 years, a wide range of alternative metrics
have been proposed. However, it is unclear to what extent this has had an
impact on NLP benchmarking efforts. Here we provide the first large-scale
cross-sectional analysis of metrics used for measuring performance in natural
language processing. We curated, mapped and systematized more than 3500 machine
learning model performance results from the open repository 'Papers with Code'
to enable a global and comprehensive analysis. Our results suggest that the
large majority of natural language processing metrics currently used have
properties that may result in an inadequate reflection of a models'
performance. Furthermore, we found that ambiguities and inconsistencies in the
reporting of metrics may lead to difficulties in interpreting and comparing
model performances, impairing transparency and reproducibility in NLP research.
Related papers
- Generalization Measures for Zero-Shot Cross-Lingual Transfer [40.35113593153817]
A model's capacity to generalize its knowledge is crucial to build robust and reliable machine learning systems.
Language model evaluation tasks lack information metrics about model generalization.
We propose a novel and stable algorithm to reliably compute the sharpness of a model optimum that correlates to generalization.
arXiv Detail & Related papers (2024-04-24T15:38:22Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation [64.5862977630713]
This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation task.
We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive.
arXiv Detail & Related papers (2024-01-12T13:23:21Z) - Benchmarking Causal Study to Interpret Large Language Models for Source
Code [6.301373791541809]
This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks.
We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
arXiv Detail & Related papers (2023-08-23T20:32:12Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z) - Evaluating the Robustness of Neural Language Models to Input
Perturbations [7.064032374579076]
In this study, we design and implement various types of character-level and word-level perturbation methods to simulate noisy input texts.
We investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations.
The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced.
arXiv Detail & Related papers (2021-08-27T12:31:17Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - A critical analysis of metrics used for measuring progress in artificial
intelligence [9.387811897655016]
We analyse the current landscape of performance metrics based on data covering 3867 machine learning model performance results.
Results suggest that the large majority of metrics currently used have properties that may result in an inadequate reflection of a models' performance.
We describe ambiguities in reported metrics, which may lead to difficulties in interpreting and comparing model performances.
arXiv Detail & Related papers (2020-08-06T11:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.