Anchor-based Bilingual Word Embeddings for Low-Resource Languages
- URL: http://arxiv.org/abs/2010.12627v2
- Date: Tue, 27 Jul 2021 11:06:05 GMT
- Title: Anchor-based Bilingual Word Embeddings for Low-Resource Languages
- Authors: Tobias Eder, Viktor Hangya, Alexander Fraser
- Abstract summary: Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
- Score: 76.48625630211943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Good quality monolingual word embeddings (MWEs) can be built for languages
which have large amounts of unlabeled text. MWEs can be aligned to bilingual
spaces using only a few thousand word translation pairs. For low resource
languages training MWEs monolingually results in MWEs of poor quality, and thus
poor bilingual word embeddings (BWEs) as well. This paper proposes a new
approach for building BWEs in which the vector space of the high resource
source language is used as a starting point for training an embedding space for
the low resource target language. By using the source vectors as anchors the
vector spaces are automatically aligned during training. We experiment on
English-German, English-Hiligaynon and English-Macedonian. We show that our
approach results not only in improved BWEs and bilingual lexicon induction
performance, but also in improved target language MWE quality as measured using
monolingual word similarity.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment [13.997006139875563]
Cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models.
We introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models.
arXiv Detail & Related papers (2024-04-03T05:58:53Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Dict-NMT: Bilingual Dictionary based NMT for Extremely Low Resource
Languages [1.8787713898828164]
We present a detailed analysis of the effects of the quality of dictionaries, training dataset size, language family, etc., on the translation quality.
Results on multiple low-resource test languages show a clear advantage of our bilingual dictionary-based method over the baselines.
arXiv Detail & Related papers (2022-06-09T12:03:29Z) - Isomorphic Cross-lingual Embeddings for Low-Resource Languages [1.5076964620370268]
Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones.
We introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language.
We show consistent gains over current methods in both quality and degree of isomorphism, as measured by bilingual lexicon induction (BLI) and eigenvalue similarity respectively.
arXiv Detail & Related papers (2022-03-28T10:39:07Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.