Integrating Unsupervised Data Generation into Self-Supervised Neural
Machine Translation for Low-Resource Languages
- URL: http://arxiv.org/abs/2107.08772v1
- Date: Mon, 19 Jul 2021 11:56:03 GMT
- Title: Integrating Unsupervised Data Generation into Self-Supervised Neural
Machine Translation for Low-Resource Languages
- Authors: Dana Ruiter, Dietrich Klakow, Josef van Genabith, Cristina
Espa\~na-Bonet
- Abstract summary: Unsupervised machine translation (UMT) exploits large amounts of monolingual data.
Self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them.
We show that including UMT techniques into SSNMT significantly outperforms SSNMT and UMT on all tested language pairs.
- Score: 25.33888871213517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For most language combinations, parallel data is either scarce or simply
unavailable. To address this, unsupervised machine translation (UMT) exploits
large amounts of monolingual data by using synthetic data generation techniques
such as back-translation and noising, while self-supervised NMT (SSNMT)
identifies parallel sentences in smaller comparable data and trains on them. To
date, the inclusion of UMT data generation techniques in SSNMT has not been
investigated. We show that including UMT techniques into SSNMT significantly
outperforms SSNMT and UMT on all tested language pairs, with improvements of up
to +4.3 BLEU, +50.8 BLEU, +51.5 over SSNMT, statistical UMT and hybrid UMT,
respectively, on Afrikaans to English. We further show that the combination of
multilingual denoising autoencoding, SSNMT with backtranslation and bilingual
finetuning enables us to learn machine translation even for distant language
pairs for which only small amounts of monolingual data are available, e.g.
yielding BLEU scores of 11.6 (English to Swahili).
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data.
We develop a multilingual neural machine translation (MNMT) model based on languages relatedness.
We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z) - Improving Simultaneous Machine Translation with Monolingual Data [94.1085601198393]
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model.
We propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD.
arXiv Detail & Related papers (2022-12-02T14:13:53Z) - Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural
Machine Translation [74.158365847236]
SixT++ is a strong many-to-English NMT model that supports 100 source languages but is trained once with a parallel dataset from only six source languages.
It significantly outperforms CRISS and m2m-100, two strong multilingual NMT systems, with an average gain of 7.2 and 5.0 BLEU respectively.
arXiv Detail & Related papers (2021-10-16T10:59:39Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Synthesizing Monolingual Data for Neural Machine Translation [22.031658738184166]
In neural machine translation (NMT), monolingual data in the target language are usually exploited to synthesize additional training parallel data.
Large monolingual data in the target domains or languages are not always available to generate large synthetic parallel data.
We propose a new method to generate large synthetic parallel data leveraging very small monolingual data in a specific domain.
arXiv Detail & Related papers (2021-01-29T08:17:40Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z) - Cross-lingual Supervision Improves Unsupervised Neural Machine
Translation [97.84871088440102]
We introduce a multilingual unsupervised NMT framework to leverage weakly supervised signals from high-resource language pairs to zero-resource translation directions.
Method significantly improves the translation quality by more than 3 BLEU score on six benchmark unsupervised translation directions.
arXiv Detail & Related papers (2020-04-07T05:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.