Related papers: Measuring Uncertainty in Translation Quality Evaluation (TQE)

Measuring Uncertainty in Translation Quality Evaluation (TQE)

URL: http://arxiv.org/abs/2111.07699v1
Date: Mon, 15 Nov 2021 12:09:08 GMT
Title: Measuring Uncertainty in Translation Quality Evaluation (TQE)
Authors: Serge Gladkoff, Irina Sorokina, Lifeng Han, Alexandra Alekseeva
Abstract summary: This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text. The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
Score: 62.997667081978825
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: From both human translators (HT) and machine translation (MT) researchers' point of view, translation quality evaluation (TQE) is an essential task. Translation service providers (TSPs) have to deliver large volumes of translations which meet customer specifications with harsh constraints of required quality level in tight time-frames and costs. MT researchers strive to make their models better, which also requires reliable quality evaluation. While automatic machine translation evaluation (MTE) metrics and quality estimation (QE) tools are widely available and easy to access, existing automated tools are not good enough, and human assessment from professional translators (HAP) are often chosen as the golden standard \cite{han-etal-2021-TQA}. Human evaluations, however, are often accused of having low reliability and agreement. Is this caused by subjectivity or statistics is at play? How to avoid the entire text to be checked and be more efficient with TQE from cost and efficiency perspectives, and what is the optimal sample size of the translated text, so as to reliably estimate the translation quality of the entire material? This work carries out such motivated research to correctly estimate the confidence intervals \cite{Brown_etal2001Interval} depending on the sample size of the translated text, e.g. the amount of words or sentences, that needs to be processed on TQE workflow step for confident and reliable evaluation of overall translation quality. The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA).

Related papers

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality. We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z)
Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation [64.5862977630713]
This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation task. We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive.
arXiv Detail & Related papers (2024-01-12T13:23:21Z)
MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs [6.822926897514793]
TQE is critical in assessing both machine translation (MT) and human translation (HT) quality without reference translations. This work examines whether the state-of-the-art large language models (LLMs) can be used for this purpose. We take OpenAI models as the best state-of-the-art technology and approach TQE as a binary classification task.
arXiv Detail & Related papers (2023-07-31T21:13:30Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
Competency-Aware Neural Machine Translation: Can Machine Translation Know its Own Translation Quality? [61.866103154161884]
Neural machine translation (NMT) is often criticized for failures that happen without awareness. We propose a novel competency-aware NMT by extending conventional NMT with a self-estimator. We show that the proposed method delivers outstanding performance on quality estimation.
arXiv Detail & Related papers (2022-11-25T02:39:41Z)
Rethink about the Word-level Quality Estimation for Machine Translation from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words. We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE. The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z)
A Bayesian approach to translators' reliability assessment [0.0]
We consider the Translation Quality Assessment process as a complex process, considering it from the physics of complex systems point of view. We build two Bayesian models that parameterise the features involved in the TQA process, namely the translation difficulty, the characteristics of the translators involved in producing the translation and assessing its quality. We show that reviewers reliability cannot be taken for granted even if they are expert translators.
arXiv Detail & Related papers (2022-03-14T14:29:45Z)
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output. It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z)
Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods [9.210509295803243]
We present a high-level and concise survey of translation quality assessment (TQA) methods, including both manual judgement criteria and automated evaluation metrics. We hope that this work will be an asset for both translation model researchers and quality assessment researchers.
arXiv Detail & Related papers (2021-05-05T18:28:10Z)
Unsupervised Quality Estimation for Neural Machine Translation [63.38918378182266]
Existing approaches require large amounts of expert annotated data, computation and time for training. We devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. We achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models.
arXiv Detail & Related papers (2020-05-21T12:38:06Z)
Revisiting Round-Trip Translation for Quality Estimation [0.0]
Quality estimation (QE) is the task of automatically evaluating the quality of translations without human-translated references. In this paper, we employ semantic embeddings to RTT-based QE. Our method achieves the highest correlations with human judgments, compared to previous WMT 2019 quality estimation metric task submissions.
arXiv Detail & Related papers (2020-04-29T03:20:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.