Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation
- URL: http://arxiv.org/abs/2109.10859v1
- Date: Wed, 22 Sep 2021 17:32:18 GMT
- Title: Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation
- Authors: Diptesh Kanojia, Marina Fomicheva, Tharindu Ranasinghe, Fr\'ed\'eric
Blain, Constantin Or\u{a}san and Lucia Specia
- Abstract summary: We propose a general methodology for adversarial testing of Quality Estimation for Machine Translation (MT) systems.
We show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect.
Second, we show that on average, the ability of a given model to discriminate between meaning-preserving and meaning-altering perturbations is predictive of its overall performance.
- Score: 25.325624543852086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current Machine Translation (MT) systems achieve very good results on a
growing variety of language pairs and datasets. However, they are known to
produce fluent translation outputs that can contain important meaning errors,
thus undermining their reliability in practice. Quality Estimation (QE) is the
task of automatically assessing the performance of MT systems at test time.
Thus, in order to be useful, QE systems should be able to detect such errors.
However, this ability is yet to be tested in the current evaluation practices,
where QE systems are assessed only in terms of their correlation with human
judgements. In this work, we bridge this gap by proposing a general methodology
for adversarial testing of QE for MT. First, we show that despite a high
correlation with human judgements achieved by the recent SOTA, certain types of
meaning errors are still problematic for QE to detect. Second, we show that on
average, the ability of a given model to discriminate between
meaning-preserving and meaning-altering perturbations is predictive of its
overall performance, thus potentially allowing for comparing QE systems without
relying on manual quality annotation.
Related papers
- Can Automatic Metrics Assess High-Quality Translations? [28.407966066693334]
We show that current metrics are insensitive to nuanced differences in translation quality.
This effect is most pronounced when the quality is high and the variance among alternatives is low.
Using the MQM framework as the gold standard, we systematically stress-test the ability of current metrics to identify translations with no errors as marked by humans.
arXiv Detail & Related papers (2024-05-28T16:44:02Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Perturbation-based QE: An Explainable, Unsupervised Word-level Quality
Estimation Method for Blackbox Machine Translation [12.376309678270275]
Perturbation-based QE works simply by analyzing MT system output on perturbed input source sentences.
Our approach is better at detecting gender bias and word-sense-disambiguation errors in translation than supervised QE.
arXiv Detail & Related papers (2023-05-12T13:10:57Z) - Proficiency Matters Quality Estimation in Grammatical Error Correction [30.31557952622774]
This study investigates how supervised quality estimation (QE) models of grammatical error correction (GEC) are affected by the learners' proficiency with the data.
arXiv Detail & Related papers (2022-01-17T03:47:19Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - Unsupervised Quality Estimation for Neural Machine Translation [63.38918378182266]
Existing approaches require large amounts of expert annotated data, computation and time for training.
We devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required.
We achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models.
arXiv Detail & Related papers (2020-05-21T12:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.