Difficulty-Aware Machine Translation Evaluation
- URL: http://arxiv.org/abs/2107.14402v1
- Date: Fri, 30 Jul 2021 02:45:36 GMT
- Title: Difficulty-Aware Machine Translation Evaluation
- Authors: Runzhe Zhan, Xuebo Liu, Derek F. Wong, Lidia S. Chao
- Abstract summary: We propose a novel difficulty-aware machine translation evaluation metric.
A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function.
Our proposed method performs well even when all the MT systems are very competitive.
- Score: 19.973201669851626
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The high-quality translation results produced by machine translation (MT)
systems still pose a huge challenge for automatic evaluation. Current MT
evaluation pays the same attention to each sentence component, while the
questions of real-world examinations (e.g., university examinations) have
different difficulties and weightings. In this paper, we propose a novel
difficulty-aware MT evaluation metric, expanding the evaluation dimension by
taking translation difficulty into consideration. A translation that fails to
be predicted by most MT systems will be treated as a difficult one and assigned
a large weight in the final score function, and conversely. Experimental
results on the WMT19 English-German Metrics shared tasks show that our proposed
method outperforms commonly used MT metrics in terms of human correlation. In
particular, our proposed method performs well even when all the MT systems are
very competitive, which is when most existing metrics fail to distinguish
between them. The source code is freely available at
https://github.com/NLP2CT/Difficulty-Aware-MT-Evaluation.
Related papers
- Cyber Risks of Machine Translation Critical Errors : Arabic Mental Health Tweets as a Case Study [3.8779763612314637]
We introduce an authentic dataset of machine translation critical errors to point to the ethical and safety issues involved in the common use of MT.
The dataset comprises mistranslations of Arabic mental health postings manually annotated with critical error types.
We also show how the commonly used quality metrics do not penalise critical errors and highlight this as a critical issue that merits further attention from researchers.
arXiv Detail & Related papers (2024-05-19T20:24:51Z) - MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - Bring More Attention to Syntactic Symmetry for Automatic Postediting of
High-Quality Machine Translations [4.217162744375792]
We propose a linguistically motivated method of regularization that is expected to enhance APE models' understanding of the target language.
Our analysis of experimental results demonstrates that the proposed method helps improving the state-of-the-art architecture's APE quality for high-quality MTs.
arXiv Detail & Related papers (2023-05-17T20:25:19Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Tackling Ambiguity with Images: Improved Multimodal Machine Translation
and Contrastive Evaluation [72.6667341525552]
We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism.
We also introduce CoMMuTE, a Contrastive Multimodal Translation Evaluation set of ambiguous sentences and their possible translations.
Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks.
arXiv Detail & Related papers (2022-12-20T10:18:18Z) - HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using
Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output.
It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.