Towards Explainable Evaluation Metrics for Natural Language Generation
- URL: http://arxiv.org/abs/2203.11131v1
- Date: Mon, 21 Mar 2022 17:05:54 GMT
- Title: Towards Explainable Evaluation Metrics for Natural Language Generation
- Authors: Christoph Leiter and Piyawat Lertvittayakumjorn and Marina Fomicheva
and Wei Zhao and Yang Gao and Steffen Eger
- Abstract summary: We identify key properties and propose key goals of explainable machine translation evaluation metrics.
We conduct own novel experiments, which find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics.
- Score: 36.594817754285984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unlike classical lexical overlap metrics such as BLEU, most current
evaluation metrics (such as BERTScore or MoverScore) are based on black-box
language models such as BERT or XLM-R. They often achieve strong correlations
with human judgments, but recent research indicates that the lower-quality
classical metrics remain dominant, one of the potential reasons being that
their decision processes are transparent. To foster more widespread acceptance
of the novel high-quality metrics, explainability thus becomes crucial. In this
concept paper, we identify key properties and propose key goals of explainable
machine translation evaluation metrics. We also provide a synthesizing overview
over recent approaches for explainable machine translation metrics and discuss
how they relate to those goals and properties. Further, we conduct own novel
experiments, which (among others) find that current adversarial NLP techniques
are unsuitable for automatically identifying limitations of high-quality
black-box evaluation metrics, as they are not meaning-preserving. Finally, we
provide a vision of future approaches to explainable evaluation metrics and
their evaluation. We hope that our work can help catalyze and guide future
research on explainable evaluation metrics and, mediately, also contribute to
better and more transparent text generation systems.
Related papers
- Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Is Context Helpful for Chat Translation Evaluation? [23.440392979857247]
We conduct a meta-evaluation of existing sentence-level automatic metrics to assess the quality of machine-translated chats.
We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings.
We propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model.
arXiv Detail & Related papers (2024-03-13T07:49:50Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Towards Explainable Evaluation Metrics for Machine Translation [32.69015745456696]
We identify key properties as well as key goals of explainable machine translation metrics.
We discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4.
arXiv Detail & Related papers (2023-06-22T17:07:57Z) - OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics [53.779709191191685]
We propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics.
OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics.
We observe that existing metrics have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge.
arXiv Detail & Related papers (2021-05-19T04:45:07Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.