Learning to Evaluate Translation Beyond English: BLEURT Submissions to
the WMT Metrics 2020 Shared Task
- URL: http://arxiv.org/abs/2010.04297v3
- Date: Mon, 19 Oct 2020 22:40:08 GMT
- Title: Learning to Evaluate Translation Beyond English: BLEURT Submissions to
the WMT Metrics 2020 Shared Task
- Authors: Thibault Sellam, Amy Pu, Hyung Won Chung, Sebastian Gehrmann, Qijun
Tan, Markus Freitag, Dipanjan Das, Ankur P. Parikh
- Abstract summary: This paper describes our contribution to the WMT 2020 Metrics Shared Task.
We make several submissions based on BLEURT, a metric based on transfer learning.
We show how to combine BLEURT's predictions with those of YiSi and use alternative reference translations to enhance the performance.
- Score: 30.889496911261677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of machine translation systems has dramatically improved over the
last decade, and as a result, evaluation has become an increasingly challenging
problem. This paper describes our contribution to the WMT 2020 Metrics Shared
Task, the main benchmark for automatic evaluation of translation. We make
several submissions based on BLEURT, a previously published metric based on
transfer learning. We extend the metric beyond English and evaluate it on 14
language pairs for which fine-tuning data is available, as well as 4
"zero-shot" language pairs, for which we have no labelled examples.
Additionally, we focus on English to German and demonstrate how to combine
BLEURT's predictions with those of YiSi and use alternative reference
translations to enhance the performance. Empirical results show that the models
achieve competitive results on the WMT Metrics 2019 Shared Task, indicating
their promise for the 2020 edition.
Related papers
- Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - KIT's Multilingual Speech Translation System for IWSLT 2023 [58.5152569458259]
We describe our speech translation system for the multilingual track of IWSLT 2023.
The task requires translation into 10 languages of varying amounts of resources.
Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation.
arXiv Detail & Related papers (2023-06-08T16:13:20Z) - Large Language Models Are State-of-the-Art Evaluators of Translation
Quality [7.818228526742237]
GEMBA is a GPT-based metric for assessment of translation quality.
We investigate nine versions of GPT models, including ChatGPT and GPT-4.
Our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels.
arXiv Detail & Related papers (2023-02-28T12:23:48Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Alibaba-Translate China's Submission for WMT 2022 Quality Estimation
Shared Task [80.22825549235556]
We present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE.
Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model.
Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings.
arXiv Detail & Related papers (2022-10-18T08:55:27Z) - Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation
System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task.
We participate in the general translation task on English$Leftrightarrow$Livonian.
Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z) - QEMind: Alibaba's Submission to the WMT21 Quality Estimation Shared Task [24.668012925628968]
We present our submissions to the WMT 2021 QE shared task.
We propose several useful features to evaluate the uncertainty of the translations to build our QE system, named textitQEMind.
We show that our multilingual systems outperform the best system in the Direct Assessment QE task of WMT 2020.
arXiv Detail & Related papers (2021-12-30T02:27:29Z) - The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task [14.629380601429956]
This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task.
We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation.
We demonstrate the competitiveness of our system compared to the widely adopted OpenKiwi-XLM baseline.
arXiv Detail & Related papers (2021-09-17T19:13:31Z) - Ensemble Fine-tuned mBERT for Translation Quality Estimation [0.0]
In this paper, we discuss our submission to the WMT 2021 QE Shared Task.
Our proposed system is an ensemble of multilingual BERT (mBERT)-based regression models.
It demonstrates comparable performance with respect to the Pearson's correlation and beats the baseline system in MAE/ RMSE for several language pairs.
arXiv Detail & Related papers (2021-09-08T20:13:06Z) - Unbabel's Participation in the WMT20 Metrics Shared Task [8.621669980568822]
We present the contribution of the Unbabel team to the WMT 2020 Shared Task on Metrics.
We intend to participate on the segment-level, document-level and system-level tracks on all language pairs.
We illustrate results of our models in these tracks with reference to test sets from the previous year.
arXiv Detail & Related papers (2020-10-29T12:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.