Combining Pretrained High-Resource Embeddings and Subword
Representations for Low-Resource Languages
- URL: http://arxiv.org/abs/2003.04419v3
- Date: Tue, 21 Apr 2020 09:43:53 GMT
- Title: Combining Pretrained High-Resource Embeddings and Subword
Representations for Low-Resource Languages
- Authors: Machel Reid, Edison Marrese-Taylor and Yutaka Matsuo
- Abstract summary: We explore techniques exploiting the qualities of morphologically rich languages (MRLs)
We show that a meta-embedding approach combining both pretrained and morphologically-informed word embeddings performs best in the downstream task of Xhosa-English translation.
- Score: 24.775371434410328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The contrast between the need for large amounts of data for current Natural
Language Processing (NLP) techniques, and the lack thereof, is accentuated in
the case of African languages, most of which are considered low-resource. To
help circumvent this issue, we explore techniques exploiting the qualities of
morphologically rich languages (MRLs), while leveraging pretrained word vectors
in well-resourced languages. In our exploration, we show that a meta-embedding
approach combining both pretrained and morphologically-informed word embeddings
performs best in the downstream task of Xhosa-English translation.
Related papers
- Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Morphological Processing of Low-Resource Languages: Where We Are and
What's Next [23.7371787793763]
We focus on approaches suitable for languages with minimal or no annotated resources.
We argue that the field is ready to tackle the logical next challenge: understanding a language's morphology from raw text alone.
arXiv Detail & Related papers (2022-03-16T19:47:04Z) - Adapting High-resource NMT Models to Translate Low-resource Related
Languages without Parallel Data [40.11208706647032]
The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages.
In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data.
Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation.
arXiv Detail & Related papers (2021-05-31T16:01:18Z) - How Low is Too Low? A Computational Perspective on Extremely
Low-Resource Languages [1.7625363344837164]
We introduce the first cross-lingual information extraction pipeline for Sumerian.
We also curate InterpretLR, an interpretability toolkit for low-resource NLP.
Most components of our pipeline can be generalised to any other language to obtain an interpretable execution.
arXiv Detail & Related papers (2021-05-30T12:09:59Z) - MetaXL: Meta Representation Transformation for Low-resource
Cross-lingual Learning [91.5426763812547]
Cross-lingual transfer learning is one of the most effective methods for building functional NLP systems for low-resource languages.
We propose MetaXL, a meta-learning based framework that learns to transform representations judiciously from auxiliary languages to a target one.
arXiv Detail & Related papers (2021-04-16T06:15:52Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Transfer learning and subword sampling for asymmetric-resource
one-to-many neural translation [14.116412358534442]
Methods for improving neural machine translation for low-resource languages are reviewed.
Tests are carried out on three artificially restricted translation tasks and one real-world task.
Experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.
arXiv Detail & Related papers (2020-04-08T14:19:05Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.