When Does Translation Require Context? A Data-driven, Multilingual
Exploration
- URL: http://arxiv.org/abs/2109.07446v2
- Date: Tue, 27 Jun 2023 17:10:50 GMT
- Title: When Does Translation Require Context? A Data-driven, Multilingual
Exploration
- Authors: Patrick Fernandes, Kayo Yin, Emmy Liu, Andr\'e F. T. Martins, Graham
Neubig
- Abstract summary: proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
- Score: 71.43817945875433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although proper handling of discourse significantly contributes to the
quality of machine translation (MT), these improvements are not adequately
measured in common translation quality metrics. Recent works in context-aware
MT attempt to target a small set of discourse phenomena during evaluation,
however not in a fully systematic way. In this paper, we develop the
Multilingual Discourse-Aware (MuDA) benchmark, a series of taggers that
identify and evaluate model performance on discourse phenomena in any given
dataset. The choice of phenomena is inspired by a novel methodology to
systematically identify translations requiring context. We confirm the
difficulty of previously studied phenomena while uncovering others that were
previously unaddressed. We find that common context-aware MT models make only
marginal improvements over context-agnostic models, which suggests these models
do not handle these ambiguities effectively. We release code and data for 14
language pairs to encourage the MT community to focus on accurately capturing
discourse phenomena.
Related papers
- Context-Aware Machine Translation with Source Coreference Explanation [26.336947440529713]
We propose a model that explains the decisions made for translation by predicting coreference features in the input.
We evaluate our method in the WMT document-level translation task of English-German dataset, the English-Russian dataset, and the multilingual TED talk dataset.
arXiv Detail & Related papers (2024-04-30T12:41:00Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - Evaluating and Improving the Coreference Capabilities of Machine
Translation Models [30.60934078720647]
Machine translation requires a wide range of linguistic capabilities.
Current end-to-end models are expected to learn implicitly by observing aligned sentences in bilingual corpora.
arXiv Detail & Related papers (2023-02-16T18:16:09Z) - PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on
User-Generated Contents [40.25277134147149]
We present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation.
Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.
arXiv Detail & Related papers (2020-11-04T04:44:47Z) - Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation [7.993547048820065]
We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena.
Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
arXiv Detail & Related papers (2020-04-30T07:15:36Z) - When Does Unsupervised Machine Translation Work? [23.690875724726908]
We conduct an empirical evaluation of unsupervised machine translation (MT) using dissimilar language pairs, dissimilar domains, diverse datasets, and authentic low-resource languages.
We find that performance rapidly deteriorates when source and target corpora are from different domains.
We additionally find that unsupervised MT performance declines when source and target languages use different scripts, and observe very poor performance on authentic low-resource language pairs.
arXiv Detail & Related papers (2020-04-12T00:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.