BMX: Boosting Natural Language Generation Metrics with Explainability
- URL: http://arxiv.org/abs/2212.10469v2
- Date: Sat, 17 Feb 2024 16:55:27 GMT
- Title: BMX: Boosting Natural Language Generation Metrics with Explainability
- Authors: Christoph Leiter, Hoa Nguyen, Steffen Eger
- Abstract summary: BMX: Boosting Natural Language Generation Metrics with explainability explicitly leverages explanations to boost the metrics' performance.
Our tests show improvements for multiple metrics across MT and summarization datasets.
- Score: 23.8476163398993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art natural language generation evaluation metrics are based on
black-box language models. Hence, recent works consider their explainability
with the goals of better understandability for humans and better metric
analysis, including failure cases. In contrast, our proposed method BMX:
Boosting Natural Language Generation Metrics with explainability explicitly
leverages explanations to boost the metrics' performance. In particular, we
perceive feature importance explanations as word-level scores, which we
convert, via power means, into a segment-level score. We then combine this
segment-level score with the original metric to obtain a better metric. Our
tests show improvements for multiple metrics across MT and summarization
datasets. While improvements in machine translation are small, they are strong
for summarization. Notably, BMX with the LIME explainer and preselected
parameters achieves an average improvement of 0.087 points in Spearman
correlation on the system-level evaluation of SummEval.
Related papers
- Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - The Inside Story: Towards Better Understanding of Machine Translation
Neural Evaluation Metrics [8.432864879027724]
We develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics.
Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors.
arXiv Detail & Related papers (2023-05-19T16:42:17Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - DEMETR: Diagnosing Evaluation Metrics for Translation [21.25704103403547]
We release DEMETR, a diagnostic dataset with 31K English examples.
We find that learned metrics perform substantially better than string-based metrics on DEMETR.
arXiv Detail & Related papers (2022-10-25T03:25:44Z) - MENLI: Robust Evaluation Metrics from Natural Language Inference [26.53850343633923]
Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks.
We develop evaluation metrics based on Natural Language Inference (NLI)
We show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics.
arXiv Detail & Related papers (2022-08-15T16:30:14Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.