PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for
Translation with Semi-Supervised Pseudo-Parallel Document Generation
- URL: http://arxiv.org/abs/2304.01282v2
- Date: Fri, 14 Apr 2023 17:54:58 GMT
- Title: PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for
Translation with Semi-Supervised Pseudo-Parallel Document Generation
- Authors: Alireza Salemi, Amirhossein Abaskohi, Sara Tavakoli, Yadollah
Yaghoobzadeh, Azadeh Shakery
- Abstract summary: This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training.
Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks.
- Score: 5.004814662623874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual pre-training significantly improves many multilingual NLP tasks,
including machine translation. Most existing methods are based on some variants
of masked language modeling and text-denoising objectives on monolingual data.
Multilingual pre-training on monolingual data ignores the availability of
parallel data in many language pairs. Also, some other works integrate the
available human-generated parallel translation data in their pre-training. This
kind of parallel data is definitely helpful, but it is limited even in
high-resource language pairs. This paper introduces a novel semi-supervised
method, SPDG, that generates high-quality pseudo-parallel data for multilingual
pre-training. First, a denoising model is pre-trained on monolingual data to
reorder, add, remove, and substitute words, enhancing the pre-training
documents' quality. Then, we generate different pseudo-translations for each
pre-training document using dictionaries for word-by-word translation and
applying the pre-trained denoising model. The resulting pseudo-parallel data is
then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our
experiments show that PEACH outperforms existing approaches used in training
mT5 and mBART on various translation tasks, including supervised, zero- and
few-shot scenarios. Moreover, PEACH's ability to transfer knowledge between
similar languages makes it particularly useful for low-resource languages. Our
results demonstrate that with high-quality dictionaries for generating accurate
pseudo-parallel, PEACH can be valuable for low-resource languages.
Related papers
- P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives [13.581385765600265]
Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community.
This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment.
arXiv Detail & Related papers (2024-07-22T09:16:30Z) - Language Agnostic Multilingual Information Retrieval with Contrastive
Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems.
We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models.
Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Translate & Fill: Improving Zero-Shot Multilingual Semantic Parsing with
Synthetic Data [2.225882303328135]
We propose a novel Translate-and-Fill (TaF) method to produce silver training data for a multilingual semantic parsing task.
Experimental results on three multilingual semantic parsing datasets show that data augmentation with TaF reaches accuracies competitive with similar systems.
arXiv Detail & Related papers (2021-09-09T14:51:11Z) - PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence
Pretraining [19.785343302320918]
We present PARADISE (PARAllel & Denoising Integration in SEquence-to-sequence models)
It extends the conventional denoising objective used to train these models by (i) replacing words in the noised sequence according to a multilingual dictionary, and (ii) predicting the reference translation according to a parallel corpus.
Our experiments on machine translation and cross-lingual natural language inference show an average improvement of 2.0 BLEU points and accuracy 6.7 points from integrating parallel data into pretraining, respectively.
arXiv Detail & Related papers (2021-08-04T07:32:56Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Bilingual Alignment Pre-training for Zero-shot Cross-lingual Transfer [33.680292990007366]
In this paper, we aim to improve the zero-shot cross-lingual transfer performance by aligning the embeddings better.
We propose a pre-training task named Alignment Language Model (AlignLM) which uses the statistical alignment information as the prior knowledge to guide bilingual word prediction.
The results show AlignLM can improve the zero-shot performance significantly on MLQA and XNLI datasets.
arXiv Detail & Related papers (2021-06-03T10:18:43Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.