Finetuning a Kalaallisut-English machine translation system using
web-crawled data
- URL: http://arxiv.org/abs/2206.02230v1
- Date: Sun, 5 Jun 2022 17:56:55 GMT
- Title: Finetuning a Kalaallisut-English machine translation system using
web-crawled data
- Authors: Alex Jones
- Abstract summary: West Greenlandic, known by native speakers as Kalaallisut, is an extremely low-resource polysynthetic language spoken by around 56,000 people in Greenland.
Here, we attempt to finetune a pretrained Kalaallisut-to-English neural machine translation (NMT) system using web-crawled pseudoparallel sentences from around 30 multilingual websites.
- Score: 6.85316573653194
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: West Greenlandic, known by native speakers as Kalaallisut, is an extremely
low-resource polysynthetic language spoken by around 56,000 people in
Greenland. Here, we attempt to finetune a pretrained Kalaallisut-to-English
neural machine translation (NMT) system using web-crawled pseudoparallel
sentences from around 30 multilingual websites. We compile a corpus of over
93,000 Kalaallisut sentences and over 140,000 Danish sentences, then use
cross-lingual sentence embeddings and approximate nearest-neighbors search in
an attempt to mine near-translations from these corpora. Finally, we translate
the Danish sentence to English to obtain a synthetic Kalaallisut-English
aligned corpus. Although the resulting dataset is too small and noisy to
improve the pretrained MT model, we believe that with additional resources, we
could construct a better pseudoparallel corpus and achieve more promising
results on MT. We also note other possible uses of the monolingual Kalaallisut
data and discuss directions for future work. We make the code and data for our
experiments publicly available.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese [47.45957604683302]
Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English.
We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language.
Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic)
We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data.
arXiv Detail & Related papers (2024-03-20T14:41:01Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Monolingual and Parallel Corpora for Kangri Low Resource Language [0.0]
This paper presents the dataset of Himachali low resource endangered language, Kangri (ISO 639-3xnr) listed in the United Nations Educational, Scientific and Cultural Organization (UNESCO)
The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri Parallel corpora.
arXiv Detail & Related papers (2021-03-22T05:52:51Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Central Yup'ik and Machine Translation of Low-Resource Polysynthetic
Languages [42.3635848780518]
Machine translation tools do not yet exist for the Yup'ik language, a polysynthetic language spoken by around 8,000 people who live primarily in Southwest Alaska.
We compiled a parallel text corpus for Yup'ik and English and developed a morphological for Yup'ik based on grammar rules.
We trained a seq2seq neural machine translation model with attention to translate Yup'ik input into English.
arXiv Detail & Related papers (2020-09-09T03:11:43Z) - An Augmented Translation Technique for low Resource language pair:
Sanskrit to Hindi translation [0.0]
In this work, Zero Shot Translation (ZST) is inspected for a low resource language pair.
The same architecture is tested for Sanskrit to Hindi translation for which data is sparse.
Dimensionality reduction of word embedding is performed to reduce the memory usage for data storage.
arXiv Detail & Related papers (2020-06-09T17:01:55Z) - Neural Machine Translation for Low-Resourced Indian Languages [4.726777092009554]
Machine translation is an effective approach to convert text to a different language without any human involvement.
In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam.
We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system.
arXiv Detail & Related papers (2020-04-19T17:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.