Related papers: Translation of Multifaceted Data without Re-Training of Machine Translation Systems

Translation of Multifaceted Data without Re-Training of Machine Translation Systems

URL: http://arxiv.org/abs/2404.16257v1
Date: Thu, 25 Apr 2024 00:05:19 GMT
Title: Translation of Multifaceted Data without Re-Training of Machine Translation Systems
Authors: Hyeonseok Moon, Seungyoon Lee, Seongtae Hong, Seungjun Lee, Chanjun Park, Heuiseok Lim,
Abstract summary: We argue that this practice often overlooks the interrelation between components within the same data point. In our MT pipeline, all the components in a data point are implementingd to form a single translation sequence. We have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data.
Score: 11.153121299726175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data. Compared with the conventional approach that translates each data component separately, our method yields better training data that enhances the performance of the trained model by 2.690 points for the web page ranking (WPR) task, and 0.845 for the question generation (QG) task in the XGLUE benchmark.

Related papers

Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data [9.67203800171351]
In many settings, in-domain monolingual target-side corpora are often available. This work explores ways to take advantage of such resources by retrieving relevant segments directly in the target language. In experiments with two RANMT architectures, we first demonstrate the benefits of such cross-lingual objectives in a controlled setting. We then showcase our method on a real-world set-up, where the target monolingual resources far exceed the amount of parallel data.
arXiv Detail & Related papers (2025-04-30T15:41:03Z)
Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z)
Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models. We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks. OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z)
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z)
Synergy with Translation Artifacts for Training and Inference in Multilingual Tasks [11.871523410051527]
This paper shows that combining the use of both translations simultaneously can synergize the results on various multilingual sentence classification tasks. We propose a cross-lingual fine-tuning algorithm called MUSC, which uses SupCon and MixUp jointly and improves the performance.
arXiv Detail & Related papers (2022-10-18T04:55:24Z)
Multilingual Bidirectional Unsupervised Translation Through Multilingual Finetuning and Back-Translation [23.401781865904386]
We propose a two-stage approach for training a single NMT model to translate unseen languages both to and from English. For the first stage, we initialize an encoder-decoder model to pretrained XLM-R and RoBERTa weights, then perform multilingual fine-tuning on parallel data in 40 languages to English. For the second stage, we leverage this generalization ability to generate synthetic parallel data from monolingual datasets, then bidirectionally train with successive rounds of back-translation.
arXiv Detail & Related papers (2022-09-06T21:20:41Z)
Learning to Generalize to More: Continuous Semantic Augmentation for Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT) CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z)
Improving Multilingual Translation by Representation and Gradient Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z)
Translate & Fill: Improving Zero-Shot Multilingual Semantic Parsing with Synthetic Data [2.225882303328135]
We propose a novel Translate-and-Fill (TaF) method to produce silver training data for a multilingual semantic parsing task. Experimental results on three multilingual semantic parsing datasets show that data augmentation with TaF reaches accuracies competitive with similar systems.
arXiv Detail & Related papers (2021-09-09T14:51:11Z)
Diverse Pretrained Context Encodings Improve Document Translation [31.03899564183553]
We propose a new architecture for adapting a sentence-level sequence-to-sequence transformer incorporating multiple pretrained document context signals. Our best multi-context model consistently outperforms the best existing context-aware transformers.
arXiv Detail & Related papers (2021-06-07T15:28:01Z)
Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation [8.554761233491236]
We analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems.
arXiv Detail & Related papers (2020-05-01T10:50:53Z)
Learning Contextualized Sentence Representations for Document-Level Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence. We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.