Related papers: Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation

Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation

URL: http://arxiv.org/abs/2004.04002v2
Date: Wed, 9 Dec 2020 08:03:58 GMT
Title: Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation
Authors: Stig-Arne Gr\"onroos and Sami Virpioja and Mikko Kurimo
Abstract summary: Methods for improving neural machine translation for low-resource languages are reviewed. Tests are carried out on three artificially restricted translation tasks and one real-world task. Experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.
Score: 14.116412358534442
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There are several approaches for improving neural machine translation for low-resource languages: Monolingual data can be exploited via pretraining or data augmentation; Parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; Subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks -- English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish -- and one real-world task, Norwegian to North S\'ami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.

Related papers

Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages. We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z)
Progressive Sentiment Analysis for Code-Switched Text Data [26.71396390928905]
We focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data. We propose a framework that takes the distinction between resource-rich and low-resource language into account.
arXiv Detail & Related papers (2022-10-25T23:13:53Z)
Low-resource Neural Machine Translation with Cross-modal Alignment [15.416659725808822]
We propose a cross-modal contrastive learning method to learn a shared space for all languages. Experimental results and further analysis show that our method can effectively learn the cross-modal and cross-lingual alignment with a small amount of image-text pairs.
arXiv Detail & Related papers (2022-10-13T04:15:43Z)
High-resource Language-specific Training for Multilingual Neural Machine Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference. Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder. HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z)
Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model [16.872474334479026]
We propose a simple refinement procedure to disentangle languages from a pre-trained multilingual UMT model. Our method achieves the state of the art in the fully unsupervised translation tasks of English to Nepali, Sinhala, Gujarati, Latvian, Estonian and Kazakh.
arXiv Detail & Related papers (2022-05-31T05:14:50Z)
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data [40.11208706647032]
The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation.
arXiv Detail & Related papers (2021-05-31T16:01:18Z)
Extremely low-resource machine translation for closely related languages [0.0]
This work focuses on closely related languages from the Uralic language family: from Estonian and Finnish. We find that multilingual learning and synthetic corpora increase the translation quality in every language pair. We show that transfer learning and fine-tuning are very effective for doing low-resource machine translation and achieve the best results.
arXiv Detail & Related papers (2021-05-27T11:27:06Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages. We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC) LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language. We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together. Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.