When Word Embeddings Become Endangered
- URL: http://arxiv.org/abs/2103.13275v1
- Date: Wed, 24 Mar 2021 15:42:53 GMT
- Title: When Word Embeddings Become Endangered
- Authors: Khalid Alnajjar
- Abstract summary: We present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and translation dictionaries of resource-poor languages.
All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
- Score: 0.685316573653194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Big languages such as English and Finnish have many natural language
processing (NLP) resources and models, but this is not the case for
low-resourced and endangered languages as such resources are so scarce despite
the great advantages they would provide for the language communities. The most
common types of resources available for low-resourced and endangered languages
are translation dictionaries and universal dependencies. In this paper, we
present a method for constructing word embeddings for endangered languages
using existing word embeddings of different resource-rich languages and the
translation dictionaries of resource-poor languages. Thereafter, the embeddings
are fine-tuned using the sentences in the universal dependencies and aligned to
match the semantic spaces of the big languages; resulting in cross-lingual
embeddings. The endangered languages we work with here are Erzya, Moksha,
Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment
analysis model for all the languages that are part of this study, whether
endangered or not, by utilizing cross-lingual word embeddings. The evaluation
conducted shows that our word embeddings for endangered languages are
well-aligned with the resource-rich languages, and they are suitable for
training task-specific models as demonstrated by our sentiment analysis model
which achieved a high accuracy. All our cross-lingual word embeddings and the
sentiment analysis model have been released openly via an easy-to-use Python
library.
Related papers
- LowREm: A Repository of Word Embeddings for 87 Low-Resource Languages Enhanced with Multilingual Graph Knowledge [0.6317163123651698]
We present LowREm, a repository of static embeddings for 87 low-resource languages.
We also propose a novel method to enhance GloVe-based embeddings by integrating multilingual graph knowledge.
arXiv Detail & Related papers (2024-09-26T18:10:26Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Contextualising Levels of Language Resourcedness affecting Digital
Processing of Text [0.5620321106679633]
We argue that the dichotomous typology LRL and HRL for all languages is problematic.
The characterization is based on the typology of contextual features for each category, rather than counting tools.
arXiv Detail & Related papers (2023-09-29T07:48:24Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages [1.0312968200748118]
We present an approach for translating word embeddings from a majority language into 4 minority languages.
Furthermore, we present a novel neural network model that is trained on English data to conduct sentiment analysis.
Our research shows that state-of-the-art neural models can be used with endangered languages.
arXiv Detail & Related papers (2023-05-24T17:40:20Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Creating Lexical Resources for Endangered Languages [2.363388546004777]
Our algorithms construct bilingual dictionaries and multilingual thesauruses using public Wordnets and a machine translator (MT)
Since our work relies on only one bilingual dictionary between an endangered language and an "intermediate helper" language, it is applicable to languages that lack many existing resources.
arXiv Detail & Related papers (2022-08-08T02:31:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.