JASS: Japanese-specific Sequence to Sequence Pre-training for Neural
Machine Translation
- URL: http://arxiv.org/abs/2005.03361v1
- Date: Thu, 7 May 2020 09:53:25 GMT
- Title: JASS: Japanese-specific Sequence to Sequence Pre-training for Neural
Machine Translation
- Authors: Zhuoyuan Mao, Fabien Cromieres, Raj Dabre, Haiyue Song, Sadao
Kurohashi
- Abstract summary: JASS is joint BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence) pre-training.
We show for the first time that joint MASS and JASS pre-training gives results that significantly surpass the individual methods.
We will release our code, pre-trained models and bunsetsu annotated data as resources for researchers to use in their own NLP tasks.
- Score: 27.364702152624034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural machine translation (NMT) needs large parallel corpora for
state-of-the-art translation quality. Low-resource NMT is typically addressed
by transfer learning which leverages large monolingual or parallel corpora for
pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence
to Sequence) are extremely effective in boosting NMT quality for languages with
small parallel corpora. However, they do not account for linguistic information
obtained using syntactic analyzers which is known to be invaluable for several
Natural Language Processing (NLP) tasks. To this end, we propose JASS,
Japanese-specific Sequence to Sequence, as a novel pre-training alternative to
MASS for NMT involving Japanese as the source or target language. JASS is joint
BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence)
pre-training which focuses on Japanese linguistic units called bunsetsus. In
our experiments on ASPEC Japanese--English and News Commentary
Japanese--Russian translation we show that JASS can give results that are
competitive with if not better than those given by MASS. Furthermore, we show
for the first time that joint MASS and JASS pre-training gives results that
significantly surpass the individual methods indicating their complementary
nature. We will release our code, pre-trained models and bunsetsu annotated
data as resources for researchers to use in their own NLP tasks.
Related papers
- PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for
Translation with Semi-Supervised Pseudo-Parallel Document Generation [5.004814662623874]
This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training.
Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks.
arXiv Detail & Related papers (2023-04-03T18:19:26Z) - $\varepsilon$ K\'U <MASK>: Integrating Yor\`ub\'a cultural greetings
into machine translation [14.469047518226708]
We present IkiniYorub'a, a Yorub'a-English translation dataset containing some Yorub'a greetings, and sample use cases.
We show that different multilingual NMT systems including Google and NLLB struggle to accurately translate Yorub'a greetings into English.
In addition, we trained a Yorub'a-English model by finetuning an existing NMT model on the training split of IkiniYorub'a and this achieved better performance when compared to the pre-trained multilingual NMT models
arXiv Detail & Related papers (2023-03-31T11:16:20Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Understanding and Improving Sequence-to-Sequence Pretraining for Neural
Machine Translation [48.50842995206353]
We study the impact of the jointly pretrained decoder, which is the main difference between Seq2Seq pretraining and previous encoder-based pretraining approaches for NMT.
We propose simple and effective strategies, named in-domain pretraining and input adaptation to remedy the domain and objective discrepancies.
arXiv Detail & Related papers (2022-03-16T07:36:28Z) - Linguistically-driven Multi-task Pre-training for Low-resource Neural
Machine Translation [31.225252462128626]
We propose Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English.
JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks.
arXiv Detail & Related papers (2022-01-20T09:10:08Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - Pre-training via Leveraging Assisting Languages and Data Selection for
Neural Machine Translation [49.51278300110449]
We propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the languages of interest.
A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora.
arXiv Detail & Related papers (2020-01-23T02:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.