Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale
- URL: http://arxiv.org/abs/2010.13588v1
- Date: Mon, 26 Oct 2020 13:57:20 GMT
- Title: Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale
- Authors: Ozan Caglayan, Pranava Madhyastha, Lucia Specia
- Abstract summary: A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
- Score: 52.663117551150954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic evaluation of language generation systems is a well-studied problem
in Natural Language Processing. While novel metrics are proposed every year, a
few popular metrics remain as the de facto metrics to evaluate tasks such as
image captioning and machine translation, despite their known limitations. This
is partly due to ease of use, and partly because researchers expect to see them
and know how to interpret them. In this paper, we urge the community for more
careful consideration of how they automatically evaluate their models by
demonstrating important failure cases on multiple datasets, language pairs and
tasks. Our experiments show that metrics (i) usually prefer system outputs to
human-authored texts, (ii) can be insensitive to correct translations of rare
words, (iii) can yield surprisingly high scores when given a single sentence as
system output for the entire test set.
Related papers
- DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers [18.279429202248632]
We introduce DISCERN, a framework for interpreting systematic biases in text classifiers using language explanations.
DISCERN iteratively generates precise natural language descriptions of systematic errors by employing an interactive loop between two large language models.
We show that users can interpret systematic biases more effectively (by over 25% relative) and efficiently when described through language explanations as opposed to cluster exemplars.
arXiv Detail & Related papers (2024-10-29T17:04:55Z) - What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations [0.0]
We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer.
Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, are fundamentally flawed when applied to Indic scripts.
We propose a shift towards developing text normalization routines that leverage native linguistic expertise.
arXiv Detail & Related papers (2024-09-04T05:08:23Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Are Large Language Models Robust Coreference Resolvers? [17.60248310475889]
We show that prompting for coreference can outperform current unsupervised coreference systems.
Further investigations reveal that instruction-tuned LMs generalize surprisingly well across domains, languages, and time periods.
arXiv Detail & Related papers (2023-05-23T19:38:28Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z) - The Impact of Indirect Machine Translation on Sentiment Classification [6.719549885077474]
We propose employing a machine translation (MT) system to translate customer feedback into another language.
As performing a direct translation is not always possible, we explore the performance of automatic classifiers on sentences that have been translated.
We conduct several experiments to analyse the performance of our proposed sentiment classification system and discuss the advantages and drawbacks of classifying translated sentences.
arXiv Detail & Related papers (2020-08-25T20:30:21Z) - How to Probe Sentence Embeddings in Low-Resource Languages: On
Structural Design Choices for Probing Task Evaluation [82.96358326053115]
We investigate sensitivity of probing task results to structural design choices.
We probe embeddings in a multilingual setup with design choices that lie in a'stable region', as we identify for English.
We find that results on English do not transfer to other languages.
arXiv Detail & Related papers (2020-06-16T12:37:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.