BlonD: An Automatic Evaluation Metric for Document-level
MachineTranslation
- URL: http://arxiv.org/abs/2103.11878v1
- Date: Mon, 22 Mar 2021 14:14:58 GMT
- Title: BlonD: An Automatic Evaluation Metric for Document-level
MachineTranslation
- Authors: Yuchen Jiang, Shuming Ma, Dongdong Zhang, Jian Yang, Haoyang Huang and
Ming Zhou
- Abstract summary: We propose an automatic metric BlonD for document-level machine translation evaluation.
BlonD takes discourse coherence into consideration by calculating the recall and distance of check-pointing phrases and tags.
- Score: 47.691277066346665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard automatic metrics (such as BLEU) are problematic for document-level
MT evaluation. They can neither distinguish document-level improvements in
translation quality from sentence-level ones nor can they identify the specific
discourse phenomena that caused the translation errors. To address these
problems, we propose an automatic metric BlonD for document-level machine
translation evaluation. BlonD takes discourse coherence into consideration by
calculating the recall and distance of check-pointing phrases and tags, and
further provides comprehensive evaluation scores by combining with n-gram.
Extensive comparisons between BlonD and existing evaluation metrics are
conducted to illustrate their critical distinctions. Experimental results show
that BlonD has a much higher document-level sensitivity with respect to
previous metrics. The human evaluation also reveals high Pearson R correlation
values between BlonD scores and manual quality judgments.
Related papers
- Robustness Tests for Automatic Machine Translation Metrics with
Adversarial Attacks [39.86206454559138]
We experiment with word- and character-level attacks on three popular machine translation metrics: BERTScore, BLEURT, and COMET.
Our human experiments validate that automatic metrics tend to overpenalize adversarially-degraded translations.
We identify patterns of brittleness that motivate more robust metric development.
arXiv Detail & Related papers (2023-11-01T13:14:23Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust
Machine Translation Evaluation [12.407789866525079]
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
arXiv Detail & Related papers (2023-05-30T15:50:46Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - End-to-End Page-Level Assessment of Handwritten Text Recognition [69.55992406968495]
HTR systems increasingly face the end-to-end page-level transcription of a document.
Standard metrics do not take into account the inconsistencies that might appear.
We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
arXiv Detail & Related papers (2023-01-14T15:43:07Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - DEMETR: Diagnosing Evaluation Metrics for Translation [21.25704103403547]
We release DEMETR, a diagnostic dataset with 31K English examples.
We find that learned metrics perform substantially better than string-based metrics on DEMETR.
arXiv Detail & Related papers (2022-10-25T03:25:44Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.