Machine Translation Customization via Automatic Training Data Selection
from the Web
- URL: http://arxiv.org/abs/2102.10243v1
- Date: Sat, 20 Feb 2021 03:29:41 GMT
- Title: Machine Translation Customization via Automatic Training Data Selection
from the Web
- Authors: Thuy Vu and Alessandro Moschitti
- Abstract summary: We describe an approach for customizing machine translation systems on specific domains.
We select data similar to the target customer data to train neural translation models.
Finally, we train MT models on our automatically selected data, obtaining a system specialized to the target domain.
- Score: 97.98885151955467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine translation (MT) systems, especially when designed for an industrial
setting, are trained with general parallel data derived from the Web. Thus,
their style is typically driven by word/structure distribution coming from the
average of many domains. In contrast, MT customers want translations to be
specialized to their domain, for which they are typically able to provide text
samples. We describe an approach for customizing MT systems on specific domains
by selecting data similar to the target customer data to train neural
translation models. We build document classifiers using monolingual target
data, e.g., provided by the customers to select parallel training data from Web
crawled data. Finally, we train MT models on our automatically selected data,
obtaining a system specialized to the target domain. We tested our approach on
the benchmark from WMT-18 Translation Task for News domains enabling
comparisons with state-of-the-art MT systems. The results show that our models
outperform the top systems while using less data and smaller models.
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Segment-Based Interactive Machine Translation for Pre-trained Models [2.0871483263418806]
We explore the use of pre-trained large language models (LLM) in interactive machine translation environments.
The system generates perfect translations interactively using the feedback provided by the user at each iteration.
We compare the performance of mBART, mT5 and a state-of-the-art (SoTA) machine translation model on a benchmark dataset regarding user effort.
arXiv Detail & Related papers (2024-07-09T16:04:21Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Alibaba-Translate China's Submission for WMT 2022 Metrics Shared Task [61.34108034582074]
We build our system based on the core idea of UNITE (Unified Translation Evaluation)
During the model pre-training phase, we first apply the pseudo-labeled data examples to continuously pre-train UNITE.
During the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years' WMT competitions.
arXiv Detail & Related papers (2022-10-18T08:51:25Z) - Data Selection Curriculum for Neural Machine Translation [31.55953464971441]
We introduce a two-stage curriculum training framework for NMT models.
We fine-tune a base NMT model on subsets of data, selected by both deterministic scoring using pre-trained methods and online scoring.
We have shown that our curriculum strategies consistently demonstrate better quality (up to +2.2 BLEU improvement) and faster convergence.
arXiv Detail & Related papers (2022-03-25T19:08:30Z) - Selecting Parallel In-domain Sentences for Neural Machine Translation
Using Monolingual Texts [0.0]
We propose a method for selecting in-domain data from generic-domain (parallel text) corpora for the task of machine translation.
The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set.
We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data.
arXiv Detail & Related papers (2021-12-11T23:29:26Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z) - A Simple Baseline to Semi-Supervised Domain Adaptation for Machine
Translation [73.3550140511458]
State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data.
We propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT.
This approach iteratively trains a Transformer-based NMT model via three training objectives: language modeling, back-translation, and supervised translation.
arXiv Detail & Related papers (2020-01-22T16:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.