Evaluating the Efficacy of Length-Controllable Machine Translation
- URL: http://arxiv.org/abs/2305.02300v1
- Date: Wed, 3 May 2023 17:50:33 GMT
- Title: Evaluating the Efficacy of Length-Controllable Machine Translation
- Authors: Hao Cheng, Meng Zhang, Weixuan Wang, Liangyou Li, Qun Liu and Zhihua
Zhang
- Abstract summary: This work is the first attempt to evaluate the automatic metrics for length-controllable machine translation tasks systematically.
We conduct a rigorous human evaluation on two translation directions and evaluate 18 summarization or translation evaluation metrics.
We find that BLEURT and COMET have the highest correlation with human evaluation and are most suitable as evaluation metrics for length-controllable machine translation.
- Score: 38.672519854291174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Length-controllable machine translation is a type of constrained translation.
It aims to contain the original meaning as much as possible while controlling
the length of the translation. We can use automatic summarization or machine
translation evaluation metrics for length-controllable machine translation, but
this is not necessarily suitable and accurate. This work is the first attempt
to evaluate the automatic metrics for length-controllable machine translation
tasks systematically. We conduct a rigorous human evaluation on two translation
directions and evaluate 18 summarization or translation evaluation metrics. We
find that BLEURT and COMET have the highest correlation with human evaluation
and are most suitable as evaluation metrics for length-controllable machine
translation.
Related papers
- BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Towards Interpretable and Efficient Automatic Reference-Based
Summarization Evaluation [160.07938471250048]
Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics.
We develop strong-performing automatic metrics for reference-based summarization evaluation.
arXiv Detail & Related papers (2023-03-07T02:49:50Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - An Automatic Evaluation of the WMT22 General Machine Translation Task [9.442139459221785]
It evaluates a total of 185 systems for 21 translation directions.
It highlights some of the current limits of state-of-the-art machine translation systems.
arXiv Detail & Related papers (2022-09-28T15:31:57Z) - Rethinking Round-Trip Translation for Machine Translation Evaluation [44.83568796515321]
We report the surprising finding that round-trip translation can be used for automatic evaluation without the references.
We demonstrate the rectification is overdue as round-trip translation could benefit multiple machine translation evaluation tasks.
arXiv Detail & Related papers (2022-09-15T15:06:20Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - Automatic Classification of Human Translation and Machine Translation: A
Study from the Perspective of Lexical Diversity [1.5229257192293197]
We show that machine translation and human translation can be classified with an accuracy above chance level.
The classification accuracy of machine translation is much higher than human translation.
arXiv Detail & Related papers (2021-05-10T18:55:04Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.