BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages
- URL: http://arxiv.org/abs/2502.11926v1
- Date: Mon, 17 Feb 2025 15:39:50 GMT
- Title: BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages
- Authors: Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva, Murja Sani Gadanya, Robert Geislinger, Bela Gipp, Oumaima Hourrane, Oana Ignat, Falalu Ibrahim Lawan, Rooweither Mabuya, Rahmad Mahendra, Vukosi Marivate, Andrew Piper, Alexander Panchenko, Charles Henrique Porto Ferreira, Vitaly Protasov, Samuel Rutunda, Manish Shrivastava, Aura Cristina Udrea, Lilian Diana Awuor Wanzare, Sophie Wu, Florian Valentin Wunderlich, Hanif Muhammad Zhafran, Tianhui Zhang, Yi Zhou, Saif M. Mohammad,
- Abstract summary: We present BRIGHTER, a collection of emotion-annotated datasets in 28 different languages.
We describe the data collection and annotation processes and the challenges of building these datasets.
We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition.
- Score: 93.92804151830744
- License:
- Abstract: People worldwide use language in subtle and complex ways to express emotions. While emotion recognition -- an umbrella term for several NLP tasks -- significantly impacts different applications in NLP and other fields, most work in the area is focused on high-resource languages. Therefore, this has led to major disparities in research and proposed solutions, especially for low-resource languages that suffer from the lack of high-quality datasets. In this paper, we present BRIGHTER-- a collection of multilabeled emotion-annotated datasets in 28 different languages. BRIGHTER covers predominantly low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances from various domains annotated by fluent speakers. We describe the data collection and annotation processes and the challenges of building these datasets. Then, we report different experimental results for monolingual and crosslingual multi-label emotion identification, as well as intensity-level emotion recognition. We investigate results with and without using LLMs and analyse the large variability in performance across languages and text domains. We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition and discuss their impact and utility.
Related papers
- Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues [4.894647740789939]
The Akan Conversation Emotion dataset is the first multimodal emotion dialogue dataset for an African language.
It contains 385 emotion-labeled dialogues and 6,162 utterances across audio, visual, and textual modalities.
The presence of prosodic labels in this dataset also makes it the first prosodically annotated African language dataset.
arXiv Detail & Related papers (2025-02-16T03:24:33Z) - Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding [20.581470997286146]
We present EthioEmo, a multi-label emotion classification dataset for four Ethiopian languages.
We perform extensive experiments with an additional English multi-label emotion dataset from SemEval 2018 Task 1.
The results show that accurate multi-label emotion classification is still insufficient even for high-resource languages.
arXiv Detail & Related papers (2024-12-17T07:42:39Z) - SCOPE: Sign Language Contextual Processing with Embedding from LLMs [49.5629738637893]
Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information.
Current methods in vision-based sign language recognition ( SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information.
We introduce SCOPE, a novel context-aware vision-based SLR and SLT framework.
arXiv Detail & Related papers (2024-09-02T08:56:12Z) - MASIVE: Open-Ended Affective State Identification in English and Spanish [10.41502827362741]
In this work, we broaden our scope to a practically unbounded set of textitaffective states, which includes any terms that humans use to describe their experiences of feeling.
We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each.
On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states.
arXiv Detail & Related papers (2024-07-16T21:43:47Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Presentation and Analysis of a Multimodal Dataset for Grounded Language
Learning [32.28310581819443]
Grounded language acquisition involves learning how language-based interactions refer to the world around them.
In practice the data used for learning tends to be cleaner, clearer, and more grammatical than actual human interactions.
We present a dataset of common household objects described by people using either spoken or written language.
arXiv Detail & Related papers (2020-07-29T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.