Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation
- URL: http://arxiv.org/abs/2004.14607v1
- Date: Thu, 30 Apr 2020 07:15:36 GMT
- Title: Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation
- Authors: Prathyusha Jwalapuram, Barbara Rychalska, Shafiq Joty and Dominika
Basaj
- Abstract summary: We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena.
Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
- Score: 7.993547048820065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite increasing instances of machine translation (MT) systems including
contextual information, the evidence for translation quality improvement is
sparse, especially for discourse phenomena. Popular metrics like BLEU are not
expressive or sensitive enough to capture quality improvements or drops that
are minor in size but significant in perception. We introduce the first of
their kind MT benchmark datasets that aim to track and hail improvements across
four main discourse phenomena: anaphora, lexical consistency, coherence and
readability, and discourse connective translation. We also introduce evaluation
methods for these tasks, and evaluate several baseline MT systems on the
curated datasets. Surprisingly, we find that existing context-aware models do
not improve discourse-related translations consistently across languages and
phenomena.
Related papers
- Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - Machine Translation Impact in E-commerce Multilingual Search [0.0]
Cross-lingual information retrieval correlates highly with the quality of Machine Translation.
There may be a threshold beyond which improving query translation quality yields little or no benefit to further improve the retrieval performance.
arXiv Detail & Related papers (2023-01-31T21:59:35Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - Document-aligned Japanese-English Conversation Parallel Corpus [4.793904440030568]
Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT.
We present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing.
We train MT models using our corpus to demonstrate how using context leads to improvements.
arXiv Detail & Related papers (2020-12-11T06:03:33Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.