Dict-NMT: Bilingual Dictionary based NMT for Extremely Low Resource
Languages
- URL: http://arxiv.org/abs/2206.04439v1
- Date: Thu, 9 Jun 2022 12:03:29 GMT
- Title: Dict-NMT: Bilingual Dictionary based NMT for Extremely Low Resource
Languages
- Authors: Nalin Kumar, Deepak Kumar, Subhankar Mishra
- Abstract summary: We present a detailed analysis of the effects of the quality of dictionaries, training dataset size, language family, etc., on the translation quality.
Results on multiple low-resource test languages show a clear advantage of our bilingual dictionary-based method over the baselines.
- Score: 1.8787713898828164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural Machine Translation (NMT) models have been effective on large
bilingual datasets. However, the existing methods and techniques show that the
model's performance is highly dependent on the number of examples in training
data. For many languages, having such an amount of corpora is a far-fetched
dream. Taking inspiration from monolingual speakers exploring new languages
using bilingual dictionaries, we investigate the applicability of bilingual
dictionaries for languages with extremely low, or no bilingual corpus. In this
paper, we explore methods using bilingual dictionaries with an NMT model to
improve translations for extremely low resource languages. We extend this work
to multilingual systems, exhibiting zero-shot properties. We present a detailed
analysis of the effects of the quality of dictionaries, training dataset size,
language family, etc., on the translation quality. Results on multiple
low-resource test languages show a clear advantage of our bilingual
dictionary-based method over the baselines.
Related papers
- The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z) - Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine
Translation [33.6064740446337]
This work explores a cheap and abundant resource to combat this problem: bilingual lexica.
We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text.
We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements; and (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones.
arXiv Detail & Related papers (2023-03-27T14:54:43Z) - Adapting High-resource NMT Models to Translate Low-resource Related
Languages without Parallel Data [40.11208706647032]
The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages.
In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data.
Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation.
arXiv Detail & Related papers (2021-05-31T16:01:18Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Assessing the Bilingual Knowledge Learned by Neural Machine Translation
Models [72.56058378313963]
We bridge the gap by assessing the bilingual knowledge learned by NMT models with phrase table.
We find that NMT models learn patterns from simple to complex and distill essential bilingual knowledge from the training examples.
arXiv Detail & Related papers (2020-04-28T03:44:34Z) - Balancing Training for Multilingual Neural Machine Translation [130.54253367251738]
multilingual machine translation (MT) models can translate to/from multiple languages.
Standard practice is to up-sample less resourced languages to increase representation.
We propose a method that instead automatically learns how to weight training data through a data scorer.
arXiv Detail & Related papers (2020-04-14T18:23:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.