Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages
- URL: http://arxiv.org/abs/2311.12489v1
- Date: Tue, 21 Nov 2023 09:59:29 GMT
- Title: Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages
- Authors: Viktor Hangya, Silvia Severini, Radoslav Ralev, Alexander Fraser,
Hinrich Sch\"utze
- Abstract summary: We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
- Score: 54.832599498774464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Very low-resource languages, having only a few million tokens worth of data,
are not well-supported by multilingual NLP approaches due to poor quality
cross-lingual word representations. Recent work showed that good cross-lingual
performance can be achieved if a source language is related to the low-resource
target language. However, not all language pairs are related. In this paper, we
propose to build multilingual word embeddings (MWEs) via a novel language
chain-based approach, that incorporates intermediate related languages to
bridge the gap between the distant source and target. We build MWEs one
language at a time by starting from the resource rich source and sequentially
adding each language in the chain till we reach the target. We extend a
semi-joint bilingual approach to multiple languages in order to eliminate the
main weakness of previous works, i.e., independently trained monolingual
embeddings, by anchoring the target language around the multilingual space. We
evaluate our method on bilingual lexicon induction for 4 language families,
involving 4 very low-resource (<5M tokens) and 4 moderately low-resource (<50M)
target languages, showing improved performance in both categories.
Additionally, our analysis reveals the importance of good quality embeddings
for intermediate languages as well as the importance of leveraging anchor
points from all languages in the multilingual space.
Related papers
- Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - xCoT: Cross-lingual Instruction Tuning for Cross-lingual
Chain-of-Thought Reasoning [36.34986831526529]
Chain-of-thought (CoT) has emerged as a powerful technique to elicit reasoning in large language models.
We propose a cross-lingual instruction fine-tuning framework (xCOT) to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2024-01-13T10:53:53Z) - When Is Multilinguality a Curse? Language Modeling for 250 High- and
Low-Resource Languages [25.52470575274251]
We pre-train over 10,000 monolingual and multilingual language models for over 250 languages.
We find that in moderation, adding multilingual data improves low-resource language modeling performance.
As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages.
arXiv Detail & Related papers (2023-11-15T18:47:42Z) - Cross-Lingual Transfer Learning for Phrase Break Prediction with
Multilingual Language Model [13.730152819942445]
Cross-lingual transfer learning can be particularly effective for improving performance in low-resource languages.
This suggests that cross-lingual transfer can be inexpensive and effective for developing TTS front-end in resource-poor languages.
arXiv Detail & Related papers (2023-06-05T04:10:04Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.