Synthesizing Monolingual Data for Neural Machine Translation
- URL: http://arxiv.org/abs/2101.12462v1
- Date: Fri, 29 Jan 2021 08:17:40 GMT
- Title: Synthesizing Monolingual Data for Neural Machine Translation
- Authors: Benjamin Marie, Atsushi Fujita
- Abstract summary: In neural machine translation (NMT), monolingual data in the target language are usually exploited to synthesize additional training parallel data.
Large monolingual data in the target domains or languages are not always available to generate large synthetic parallel data.
We propose a new method to generate large synthetic parallel data leveraging very small monolingual data in a specific domain.
- Score: 22.031658738184166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In neural machine translation (NMT), monolingual data in the target language
are usually exploited through a method so-called "back-translation" to
synthesize additional training parallel data. The synthetic data have been
shown helpful to train better NMT, especially for low-resource language pairs
and domains. Nonetheless, large monolingual data in the target domains or
languages are not always available to generate large synthetic parallel data.
In this work, we propose a new method to generate large synthetic parallel data
leveraging very small monolingual data in a specific domain. We fine-tune a
pre-trained GPT-2 model on such small in-domain monolingual data and use the
resulting model to generate a large amount of synthetic in-domain monolingual
data. Then, we perform back-translation, or forward translation, to generate
synthetic in-domain parallel data. Our preliminary experiments on three
language pairs and five domains show the effectiveness of our method to
generate fully synthetic but useful in-domain parallel data for improving NMT
in all configurations. We also show promising results in extreme adaptation for
personalized NMT.
Related papers
- A Morphologically-Aware Dictionary-based Data Augmentation Technique for
Machine Translation of Under-Represented Languages [31.18983138590214]
We propose strategies to synthesize parallel data relying on morpho-syntactic information and using bilingual lexicons.
Our methodology adheres to a realistic scenario backed by the small parallel seed data.
It is linguistically informed, as it aims to create augmented data that is more likely to be grammatically correct.
arXiv Detail & Related papers (2024-02-02T22:25:44Z) - Ngambay-French Neural Machine Translation (sba-Fr) [16.55378462843573]
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers.
In this project, we created the first sba-Fr dataset, which is a corpus of Ngambay-to-French translations.
Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data.
arXiv Detail & Related papers (2023-08-25T17:13:20Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Synthetic Pre-Training Tasks for Neural Machine Translation [16.6378815054841]
Our goal is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources.
We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge.
Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data.
arXiv Detail & Related papers (2022-12-19T21:34:00Z) - Better Datastore, Better Translation: Generating Datastores from
Pre-Trained Models for Nearest Neural Machine Translation [48.58899349349702]
Nearest Neighbor Machine Translation (kNNMT) is a simple and effective method of augmenting neural machine translation (NMT) with a token-level nearest neighbor retrieval mechanism.
In this paper, we propose PRED, a framework that leverages Pre-trained models for Datastores in kNN-MT.
arXiv Detail & Related papers (2022-12-17T08:34:20Z) - Robust Domain Adaptation for Pre-trained Multilingual Neural Machine
Translation Models [0.0]
We propose a fine-tuning procedure for the generic mNMT that combines embeddings freezing and adversarial loss.
Experiments demonstrated that the procedure improves performances on specialized data with a minimal loss in initial performances on generic domain for all languages pairs.
arXiv Detail & Related papers (2022-10-26T18:47:45Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT [9.797319790710711]
AUGVIC is a novel data augmentation framework for low-resource NMT.
It exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly.
We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation.
arXiv Detail & Related papers (2021-06-09T15:29:18Z) - Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model.
Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.