Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics
Without the Reference
- URL: http://arxiv.org/abs/2301.09008v3
- Date: Tue, 25 Apr 2023 13:10:59 GMT
- Title: Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics
Without the Reference
- Authors: Vil\'em Zouhar, Shehzaad Dhuliawala, Wangchunshu Zhou, Nico Daheim,
Tom Kocmi, Yuchen Eleanor Jiang, Mrinmaya Sachan
- Abstract summary: State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements.
We show that even without access to the reference, our model can estimate automated metrics at the sentence-level.
Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model.
- Score: 27.051818618331428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine translation quality estimation (QE) predicts human judgements of a
translation hypothesis without seeing the reference. State-of-the-art QE
systems based on pretrained language models have been achieving remarkable
correlations with human judgements yet they are computationally heavy and
require human annotations, which are slow and expensive to create. To address
these limitations, we define the problem of metric estimation (ME) where one
predicts the automated metric scores also without the reference. We show that
even without access to the reference, our model can estimate automated metrics
($\rho$=60% for BLEU, $\rho$=51% for other metrics) at the sentence-level.
Because automated metrics correlate with human judgements, we can leverage the
ME task for pre-training a QE model. For the QE task, we find that pre-training
on TER is better ($\rho$=23%) than training for scratch ($\rho$=20%).
Related papers
- Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation [14.405862891194344]
We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors.
Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output.
We propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones.
arXiv Detail & Related papers (2024-04-27T23:52:51Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation [25.325624543852086]
We propose a general methodology for adversarial testing of Quality Estimation for Machine Translation (MT) systems.
We show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect.
Second, we show that on average, the ability of a given model to discriminate between meaning-preserving and meaning-altering perturbations is predictive of its overall performance.
arXiv Detail & Related papers (2021-09-22T17:32:18Z) - MDQE: A More Accurate Direct Pretraining for Machine Translation Quality
Estimation [4.416484585765028]
We argue that there are still gaps between the predictor and the estimator in both data quality and training objectives.
We propose a novel framework that provides a more accurate direct pretraining for QE tasks.
arXiv Detail & Related papers (2021-07-24T09:48:37Z) - To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for
Machine Translation [5.972205906525993]
We investigate which metrics have the highest accuracy to make system-level quality rankings for pairs of systems.
We show that the sole use of BLEU negatively affected the past development of improved models.
arXiv Detail & Related papers (2021-07-22T17:22:22Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - Unsupervised Quality Estimation for Neural Machine Translation [63.38918378182266]
Existing approaches require large amounts of expert annotated data, computation and time for training.
We devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required.
We achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models.
arXiv Detail & Related papers (2020-05-21T12:38:06Z) - KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z) - Revisiting Round-Trip Translation for Quality Estimation [0.0]
Quality estimation (QE) is the task of automatically evaluating the quality of translations without human-translated references.
In this paper, we employ semantic embeddings to RTT-based QE.
Our method achieves the highest correlations with human judgments, compared to previous WMT 2019 quality estimation metric task submissions.
arXiv Detail & Related papers (2020-04-29T03:20:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.