DivEMT: Neural Machine Translation Post-Editing Effort Across
Typologically Diverse Languages
- URL: http://arxiv.org/abs/2205.12215v1
- Date: Tue, 24 May 2022 17:22:52 GMT
- Title: DivEMT: Neural Machine Translation Post-Editing Effort Across
Typologically Diverse Languages
- Authors: Gabriele Sarti, Arianna Bisazza, Ana Guerberof Arenas, Antonio Toral
- Abstract summary: DivEMT is the first publicly available post-editing study of Neural Machine Translation (NMT) over a typologically diverse set of target languages.
We assess the impact on translation productivity of two state-of-the-art NMT systems, namely: Google Translate and the open-source multilingual model mBART50.
- Score: 5.367993194110256
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce DivEMT, the first publicly available post-editing study of
Neural Machine Translation (NMT) over a typologically diverse set of target
languages. Using a strictly controlled setup, 18 professional translators were
instructed to translate or post-edit the same set of English documents into
Arabic, Dutch, Italian, Turkish, Ukrainian, and Vietnamese. During the process,
their edits, keystrokes, editing times, pauses, and perceived effort were
recorded, enabling an in-depth, cross-lingual evaluation of NMT quality and its
post-editing process. Using this new dataset, we assess the impact on
translation productivity of two state-of-the-art NMT systems, namely: Google
Translate and the open-source multilingual model mBART50. We find that, while
post-editing is consistently faster than translation from scratch, the
magnitude of its contribution varies largely across systems and languages,
ranging from doubled productivity in Dutch and Italian to marginal gains in
Arabic, Turkish and Ukrainian, for some of the evaluated modalities. Moreover,
the observed cross-language variability appears to partly reflect source-target
relatedness and type of target morphology, while remaining hard to predict even
based on state-of-the-art automatic MT quality metrics. We publicly release the
complete dataset, including all collected behavioural data, to foster new
research on the ability of state-of-the-art NMT systems to generate text in
typologically diverse languages.
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - An Empirical Study on the Robustness of Massively Multilingual Neural Machine Translation [40.08063412966712]
Massively multilingual neural machine translation (MMNMT) has been proven to enhance the translation quality of low-resource languages.
We create a robustness evaluation benchmark dataset for Indonesian-Chinese translation.
This dataset is automatically translated into Chinese using four NLLB-200 models of different sizes.
arXiv Detail & Related papers (2024-05-13T12:01:54Z) - Cross-lingual neural fuzzy matching for exploiting target-language
monolingual corpora in computer-aided translation [0.0]
In this paper, we introduce a novel neural approach aimed at exploiting in-domain target-language (TL) monolingual corpora.
Our approach relies on cross-lingual sentence embeddings to retrieve translation proposals from TL monolingual corpora, and on a neural model to estimate their post-editing effort.
The paper presents an automatic evaluation of these techniques on four language pairs that shows that our approach can successfully exploit monolingual texts in a TM-based CAT environment.
arXiv Detail & Related papers (2024-01-16T14:00:28Z) - An approach for mistranslation removal from popular dataset for Indic MT
Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency.
Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment.
The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z) - Translation-Enhanced Multilingual Text-to-Image Generation [61.41730893884428]
Research on text-to-image generation (TTI) still predominantly focuses on the English language.
In this work, we thus investigate multilingual TTI and the current potential of neural machine translation (NMT) to bootstrap mTTI systems.
We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework.
arXiv Detail & Related papers (2023-05-30T17:03:52Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Improving Multilingual Neural Machine Translation System for Indic
Languages [0.0]
We propose a multilingual neural machine translation (MNMT) system to address the issues related to low-resource language translation.
A state-of-the-art transformer architecture is used to realize the proposed model.
Trials over a good amount of data reveal its superiority over the conventional models.
arXiv Detail & Related papers (2022-09-27T09:51:56Z) - Multilingual Neural Machine Translation:Can Linguistic Hierarchies Help? [29.01386302441015]
Multilingual Neural Machine Translation (MNMT) trains a single NMT model that supports translation between multiple languages.
The performance of an MNMT model is highly dependent on the type of languages used in training, as transferring knowledge from a diverse set of languages degrades the translation performance due to negative transfer.
We propose a Hierarchical Knowledge Distillation (HKD) approach for MNMT which capitalises on language groups generated according to typological features and phylogeny of languages to overcome the issue of negative transfer.
arXiv Detail & Related papers (2021-10-15T02:31:48Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.