Related papers: Machine Translation Customization via Automatic Training Data Selection from the Web

Machine Translation Customization via Automatic Training Data Selection from the Web

URL: http://arxiv.org/abs/2102.10243v1
Date: Sat, 20 Feb 2021 03:29:41 GMT
Title: Machine Translation Customization via Automatic Training Data Selection from the Web
Authors: Thuy Vu and Alessandro Moschitti
Abstract summary: We describe an approach for customizing machine translation systems on specific domains. We select data similar to the target customer data to train neural translation models. Finally, we train MT models on our automatically selected data, obtaining a system specialized to the target domain.
Score: 97.98885151955467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine translation (MT) systems, especially when designed for an industrial setting, are trained with general parallel data derived from the Web. Thus, their style is typically driven by word/structure distribution coming from the average of many domains. In contrast, MT customers want translations to be specialized to their domain, for which they are typically able to provide text samples. We describe an approach for customizing MT systems on specific domains by selecting data similar to the target customer data to train neural translation models. We build document classifiers using monolingual target data, e.g., provided by the customers to select parallel training data from Web crawled data. Finally, we train MT models on our automatically selected data, obtaining a system specialized to the target domain. We tested our approach on the benchmark from WMT-18 Translation Task for News domains enabling comparisons with state-of-the-art MT systems. The results show that our models outperform the top systems while using less data and smaller models.

Related papers

Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z)
Segment-Based Interactive Machine Translation for Pre-trained Models [2.0871483263418806]
We explore the use of pre-trained large language models (LLM) in interactive machine translation environments. The system generates perfect translations interactively using the feedback provided by the user at each iteration. We compare the performance of mBART, mT5 and a state-of-the-art (SoTA) machine translation model on a benchmark dataset regarding user effort.
arXiv Detail & Related papers (2024-07-09T16:04:21Z)
Only Send What You Need: Learning to Communicate Efficiently in Federated Multilingual Machine Translation [17.159005029204092]
This paper focuses on a practical federated multilingual learning setup where clients with their own language-specific data aim to collaboratively construct a high-quality neural machine translation (NMT) model. We propose a meta-learning-based adaptive parameter selection methodology, MetaSend, that improves the communication efficiency of model transmissions from clients during FL-based multilingual NMT training.
arXiv Detail & Related papers (2024-01-15T04:04:26Z)
Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models. We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks. OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z)
Alibaba-Translate China's Submission for WMT 2022 Metrics Shared Task [61.34108034582074]
We build our system based on the core idea of UNITE (Unified Translation Evaluation) During the model pre-training phase, we first apply the pseudo-labeled data examples to continuously pre-train UNITE. During the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years' WMT competitions.
arXiv Detail & Related papers (2022-10-18T08:51:25Z)
Data Selection Curriculum for Neural Machine Translation [31.55953464971441]
We introduce a two-stage curriculum training framework for NMT models. We fine-tune a base NMT model on subsets of data, selected by both deterministic scoring using pre-trained methods and online scoring. We have shown that our curriculum strategies consistently demonstrate better quality (up to +2.2 BLEU improvement) and faster convergence.
arXiv Detail & Related papers (2022-03-25T19:08:30Z)
Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts [0.0]
We propose a method for selecting in-domain data from generic-domain (parallel text) corpora for the task of machine translation. The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set. We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data.
arXiv Detail & Related papers (2021-12-11T23:29:26Z)
Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision. We propose domain data selection methods based on such models. We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
A Simple Baseline to Semi-Supervised Domain Adaptation for Machine Translation [73.3550140511458]
State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data. We propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT. This approach iteratively trains a Transformer-based NMT model via three training objectives: language modeling, back-translation, and supervised translation.
arXiv Detail & Related papers (2020-01-22T16:42:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.