Learning and Evaluating Emotion Lexicons for 91 Languages
- URL: http://arxiv.org/abs/2005.05672v1
- Date: Tue, 12 May 2020 10:32:03 GMT
- Title: Learning and Evaluating Emotion Lexicons for 91 Languages
- Authors: Sven Buechel, Susanna R\"ucker, Udo Hahn
- Abstract summary: We introduce a methodology for creating almost arbitrarily large emotion lexicons for any target language.
We generate representationally rich high-coverage lexicons comprising eight emotional variables with more than 100k lexical entries each.
Our approach produces results in line with state-of-the-art monolingual approaches to lexicon creation and even surpasses human reliability for some languages and variables.
- Score: 10.06987680744477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotion lexicons describe the affective meaning of words and thus constitute
a centerpiece for advanced sentiment and emotion analysis. Yet, manually
curated lexicons are only available for a handful of languages, leaving most
languages of the world without such a precious resource for downstream
applications. Even worse, their coverage is often limited both in terms of the
lexical units they contain and the emotional variables they feature. In order
to break this bottleneck, we here introduce a methodology for creating almost
arbitrarily large emotion lexicons for any target language. Our approach
requires nothing but a source language emotion lexicon, a bilingual word
translation model, and a target language embedding model. Fulfilling these
requirements for 91 languages, we are able to generate representationally rich
high-coverage lexicons comprising eight emotional variables with more than 100k
lexical entries each. We evaluated the automatically generated lexicons against
human judgment from 26 datasets, spanning 12 typologically diverse languages,
and found that our approach produces results in line with state-of-the-art
monolingual approaches to lexicon creation and even surpasses human reliability
for some languages and variables. Code and data are available at
https://github.com/JULIELab/MEmoLon archived under DOI
https://doi.org/10.5281/zenodo.3779901.
Related papers
- Human-LLM Collaborative Construction of a Cantonese Emotion Lexicon [1.3074442742310615]
This study proposes to develop an emotion lexicon for Cantonese, a low-resource language.
By integrating emotion labels provided by Large Language Models (LLMs) and human annotators, the study leveraged existing linguistic resources.
The consistency of the proposed emotion lexicon in emotion extraction was assessed through modification and utilization of three distinct emotion text datasets.
arXiv Detail & Related papers (2024-10-15T11:57:34Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Aya Model: An Instruction Finetuned Open-Access Multilingual Language
Model [33.87586041774359]
Aya is a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced.
We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages.
We conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
arXiv Detail & Related papers (2024-02-12T17:34:13Z) - English Prompts are Better for NLI-based Zero-Shot Emotion
Classification than Target-Language Prompts [17.099269597133265]
We show that it is consistently better to use English prompts even if the data is in a different language.
Our experiments with natural language inference-based language models show that it is consistently better to use English prompts even if the data is in a different language.
arXiv Detail & Related papers (2024-02-05T17:36:19Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Detecting Languages Unintelligible to Multilingual Models through Local
Structure Probes [15.870989191524094]
We develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model.
Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language.
arXiv Detail & Related papers (2022-11-09T16:45:16Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.