Practical Perspectives on Quality Estimation for Machine Translation
- URL: http://arxiv.org/abs/2005.03519v1
- Date: Sat, 2 May 2020 01:50:10 GMT
- Title: Practical Perspectives on Quality Estimation for Machine Translation
- Authors: Junpei Zhou, Ciprian Chelba, Yuezhang (Music) Li
- Abstract summary: Sentence level quality estimation (QE) for machine translation (MT) attempts to predict the translation edit rate (TER) cost of post-editing work required to correct MT output.
We find consumers of MT output who are primarily interested in a binary quality metric: is the translated sentence adequate as-is or does it need post-editing?
We demonstrate that, while classical QE regression models fare poorly on this task, they can be re-purposed by replacing the output regression layer with a binary classification one, achieving 50-60% recall at 90% precision.
- Score: 6.400178956011897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentence level quality estimation (QE) for machine translation (MT) attempts
to predict the translation edit rate (TER) cost of post-editing work required
to correct MT output. We describe our view on sentence-level QE as dictated by
several practical setups encountered in the industry. We find consumers of MT
output---whether human or algorithmic ones---to be primarily interested in a
binary quality metric: is the translated sentence adequate as-is or does it
need post-editing? Motivated by this we propose a quality classification (QC)
view on sentence-level QE whereby we focus on maximizing recall at precision
above a given threshold. We demonstrate that, while classical QE regression
models fare poorly on this task, they can be re-purposed by replacing the
output regression layer with a binary classification one, achieving 50-60\%
recall at 90\% precision. For a high-quality MT system producing 75-80\%
correct translations, this promises a significant reduction in post-editing
work indeed.
Related papers
- Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean [7.843029855730508]
We develop a 1200-sentence MQM evaluation benchmark for the language pair English-Korean.
We find that reference-free setup outperforms its counterpart in the style dimension.
Overall, RemBERT emerges as the most promising model.
arXiv Detail & Related papers (2024-03-19T12:02:38Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs [6.822926897514793]
TQE is critical in assessing both machine translation (MT) and human translation (HT) quality without reference translations.
This work examines whether the state-of-the-art large language models (LLMs) can be used for this purpose.
We take OpenAI models as the best state-of-the-art technology and approach TQE as a binary classification task.
arXiv Detail & Related papers (2023-07-31T21:13:30Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Competency-Aware Neural Machine Translation: Can Machine Translation
Know its Own Translation Quality? [61.866103154161884]
Neural machine translation (NMT) is often criticized for failures that happen without awareness.
We propose a novel competency-aware NMT by extending conventional NMT with a self-estimator.
We show that the proposed method delivers outstanding performance on quality estimation.
arXiv Detail & Related papers (2022-11-25T02:39:41Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - PreQuEL: Quality Estimation of Machine Translation Outputs in Advance [32.922128367314194]
A PreQuEL system predicts how well a given sentence will be translated, without recourse to the actual translation.
We develop a baseline model for the task and analyze its performance.
We show that this augmentation method can improve the performance of the Quality-Estimation task as well.
arXiv Detail & Related papers (2022-05-18T18:55:05Z) - HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using
Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output.
It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - Unsupervised Quality Estimation for Neural Machine Translation [63.38918378182266]
Existing approaches require large amounts of expert annotated data, computation and time for training.
We devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required.
We achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models.
arXiv Detail & Related papers (2020-05-21T12:38:06Z) - Revisiting Round-Trip Translation for Quality Estimation [0.0]
Quality estimation (QE) is the task of automatically evaluating the quality of translations without human-translated references.
In this paper, we employ semantic embeddings to RTT-based QE.
Our method achieves the highest correlations with human judgments, compared to previous WMT 2019 quality estimation metric task submissions.
arXiv Detail & Related papers (2020-04-29T03:20:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.