Related papers: MASIVE: Open-Ended Affective State Identification in English and Spanish

MASIVE: Open-Ended Affective State Identification in English and Spanish

URL: http://arxiv.org/abs/2407.12196v1
Date: Tue, 16 Jul 2024 21:43:47 GMT
Title: MASIVE: Open-Ended Affective State Identification in English and Spanish
Authors: Nicholas Deas, Elsbeth Turcan, Iván Pérez Mejía, Kathleen McKeown,
Abstract summary: In this work, we broaden our scope to a practically unbounded set of textitaffective states, which includes any terms that humans use to describe their experiences of feeling. We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each. On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states.
Score: 10.41502827362741
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the field of emotion analysis, much NLP research focuses on identifying a limited number of discrete emotion categories, often applied across languages. These basic sets, however, are rarely designed with textual data in mind, and culture, language, and dialect can influence how particular emotions are interpreted. In this work, we broaden our scope to a practically unbounded set of \textit{affective states}, which includes any terms that humans use to describe their experiences of feeling. We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each. We then define the new problem of \textit{affective state identification} for language generation models framed as a masked span prediction task. On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states. Additionally, we show that pretraining on MASIVE improves model performance on existing emotion benchmarks. Finally, through machine translation experiments, we find that native speaker-written data is vital to good performance on this task.

Related papers

The Emergence of Abstract Thought in Large Language Models Beyond Any Language [95.50197866832772]
Large language models (LLMs) function effectively across a diverse range of languages.<n>Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts.<n>Recent results show strong multilingual performance, even surpassing English performance on specific tasks in other languages.
arXiv Detail & Related papers (2025-06-11T16:00:54Z)
BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER -- a collection of multi-labeled datasets in 28 different languages. We describe the data collection and annotation processes and the challenges of building these datasets. We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z)
English Prompts are Better for NLI-based Zero-Shot Emotion Classification than Target-Language Prompts [17.099269597133265]
We show that it is consistently better to use English prompts even if the data is in a different language. Our experiments with natural language inference-based language models show that it is consistently better to use English prompts even if the data is in a different language.
arXiv Detail & Related papers (2024-02-05T17:36:19Z)
Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification [8.010713141364752]
We study the effect of language on emotion prediction across 3 PLMs on a Hinglish emotion classification dataset. We find that models do learn these associations between language choice and emotional expression. Having code-mixed data present in the pre-training can augment that learning when task-specific data is scarce.
arXiv Detail & Related papers (2024-02-05T16:05:32Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP. We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba. Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region. All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z)
Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing. Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z)
Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-Training [32.800766653254634]
We present the most comprehensive study of cross-lingual stance detection to date. We use 15 diverse datasets in 12 languages from 6 language families. For our experiments, we build on pattern-exploiting training, proposing the addition of a novel label encoder.
arXiv Detail & Related papers (2021-09-13T15:20:06Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Improving Indonesian Text Classification Using Multilingual Language Model [0.0]
This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification models. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance.
arXiv Detail & Related papers (2020-09-12T03:16:25Z)
Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations. We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.