Robustness Tests for Automatic Machine Translation Metrics with
Adversarial Attacks
- URL: http://arxiv.org/abs/2311.00508v1
- Date: Wed, 1 Nov 2023 13:14:23 GMT
- Title: Robustness Tests for Automatic Machine Translation Metrics with
Adversarial Attacks
- Authors: Yichen Huang, Timothy Baldwin
- Abstract summary: We experiment with word- and character-level attacks on three popular machine translation metrics: BERTScore, BLEURT, and COMET.
Our human experiments validate that automatic metrics tend to overpenalize adversarially-degraded translations.
We identify patterns of brittleness that motivate more robust metric development.
- Score: 39.86206454559138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate MT evaluation metric performance on adversarially-synthesized
texts, to shed light on metric robustness. We experiment with word- and
character-level attacks on three popular machine translation metrics:
BERTScore, BLEURT, and COMET. Our human experiments validate that automatic
metrics tend to overpenalize adversarially-degraded translations. We also
identify inconsistencies in BERTScore ratings, where it judges the original
sentence and the adversarially-degraded one as similar, while judging the
degraded translation as notably worse than the original with respect to the
reference. We identify patterns of brittleness that motivate more robust metric
development.
Related papers
- BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing
Critical Translation Errors in Sentiment-oriented Text [1.4213973379473654]
Machine Translation (MT) of the online content is commonly used to process posts written in several languages.
In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors.
We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
arXiv Detail & Related papers (2021-09-29T07:51:17Z) - BlonD: An Automatic Evaluation Metric for Document-level
MachineTranslation [47.691277066346665]
We propose an automatic metric BlonD for document-level machine translation evaluation.
BlonD takes discourse coherence into consideration by calculating the recall and distance of check-pointing phrases and tags.
arXiv Detail & Related papers (2021-03-22T14:14:58Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.