Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages
- URL: http://arxiv.org/abs/2305.15380v1
- Date: Wed, 24 May 2023 17:40:20 GMT
- Title: Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages
- Authors: Khalid Alnajjar, Mika H\"am\"al\"ainen, Jack Rueter
- Abstract summary: We present an approach for translating word embeddings from a majority language into 4 minority languages.
Furthermore, we present a novel neural network model that is trained on English data to conduct sentiment analysis.
Our research shows that state-of-the-art neural models can be used with endangered languages.
- Score: 1.0312968200748118
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present an approach for translating word embeddings from a
majority language into 4 minority languages: Erzya, Moksha, Udmurt and
Komi-Zyrian. Furthermore, we align these word embeddings and present a novel
neural network model that is trained on English data to conduct sentiment
analysis and then applied on endangered language data through the aligned word
embeddings. To test our model, we annotated a small sentiment analysis corpus
for the 4 endangered languages and Finnish. Our method reached at least 56\%
accuracy for each endangered language. The models and the sentiment corpus will
be released together with this paper. Our research shows that state-of-the-art
neural models can be used with endangered languages with the only requirement
being a dictionary between the endangered language and a majority language.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Ensemble Language Models for Multilingual Sentiment Analysis [0.0]
We explore sentiment analysis on tweet texts from SemEval-17 and the Arabic Sentiment Tweet dataset.
Our findings include monolingual models exhibiting superior performance and ensemble models outperforming the baseline.
arXiv Detail & Related papers (2024-03-10T01:39:10Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - When Word Embeddings Become Endangered [0.685316573653194]
We present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and translation dictionaries of resource-poor languages.
All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
arXiv Detail & Related papers (2021-03-24T15:42:53Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Classification Benchmarks for Under-resourced Bengali Language based on
Multichannel Convolutional-LSTM Network [3.0168410626760034]
We build the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText.
We incorporate word embeddings into a Multichannel Convolutional-LSTM network for predicting different types of hate speech, document classification, and sentiment analysis.
arXiv Detail & Related papers (2020-04-11T22:17:04Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.