XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection
- URL: http://arxiv.org/abs/2011.01612v2
- Date: Fri, 6 Nov 2020 10:50:06 GMT
- Title: XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection
- Authors: Emily \"Ohman, Marc P\`amies, Kaisla Kajava, J\"org Tiedemann
- Abstract summary: The dataset consists of human-annotated Finnish (25k) and English sentences (30k)
We use Plutchik's core emotions to annotate the dataset with the addition of neutral to create a multilabel multiclass dataset.
The dataset is carefully evaluated using language-specific BERT models and SVMs to show that XED performs on par with other similar datasets.
- Score: 0.42056926734482064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce XED, a multilingual fine-grained emotion dataset. The dataset
consists of human-annotated Finnish (25k) and English sentences (30k), as well
as projected annotations for 30 additional languages, providing new resources
for many low-resource languages. We use Plutchik's core emotions to annotate
the dataset with the addition of neutral to create a multilabel multiclass
dataset. The dataset is carefully evaluated using language-specific BERT models
and SVMs to show that XED performs on par with other similar datasets and is
therefore a useful tool for sentiment analysis and emotion detection.
Related papers
- SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection [76.18321723846616]
Task covers more than 30 languages from seven distinct language families.
Data instances are multi-labeled with six emotional classes, with additional datasets in 11 languages annotated for emotion intensity.
Participants were asked to predict labels in three tracks: (a) multilabel emotion detection, (b) emotion intensity score detection, and (c) cross-lingual emotion detection.
arXiv Detail & Related papers (2025-03-10T12:49:31Z) - BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER -- a collection of multi-labeled datasets in 28 different languages.
We describe the data collection and annotation processes and the challenges of building these datasets.
We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z) - Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language [7.59001382786429]
This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German.
Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset.
We train several state-of-the-art machine learning models on the automatically created dataset and release them as well.
arXiv Detail & Related papers (2024-03-25T19:40:26Z) - SER_AMPEL: a multi-source dataset for speech emotion recognition of
Italian older adults [58.49386651361823]
SER_AMPEL is a multi-source dataset for speech emotion recognition (SER)
It is collected with the aim of providing a reference for speech emotion recognition in case of Italian older adults.
The evidence of the need for such a dataset emerges from the analysis of the state of the art.
arXiv Detail & Related papers (2023-11-24T13:47:25Z) - Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages.
We leverage parallel translations of the Bible to construct such a dataset.
By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z) - UIO at SemEval-2023 Task 12: Multilingual fine-tuning for sentiment
classification in low-resource languages [0.0]
We show how a multilingual large language model can be a resource for sentiment analysis in languages not seen during pretraining.
The languages are to various degrees related to languages used during pretraining, and the language data contain various degrees of code-switching.
We experiment with both monolingual and multilingual datasets for the final fine-tuning, and find that with the provided datasets that contain samples in the thousands, monolingual fine-tuning yields the best results.
arXiv Detail & Related papers (2023-04-27T13:51:18Z) - MINION: a Large-Scale and Diverse Dataset for Multilingual Event
Detection [65.46122357928041]
Event Detection (ED) is the task of identifying and classifying trigger words of event mentions in text.
Main questions include how well existing ED models perform on different languages, how challenging ED is in other languages, and how well ED knowledge and annotation can be transferred across languages.
We introduce a new large-scale multilingual dataset for ED (called MINION) that consistently annotates events for 8 different languages.
arXiv Detail & Related papers (2022-11-11T02:09:51Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - A Data Bootstrapping Recipe for Low Resource Multilingual Relation
Classification [38.83366564843953]
IndoRE is a dataset with 21K entity and relation tagged gold sentences in three Indian languages, plus English.
We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information.
We study the accuracy efficiency tradeoff between expensive gold instances vs. translated and aligned'silver' instances.
arXiv Detail & Related papers (2021-10-18T18:40:46Z) - DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation
Extraction [15.649929244635269]
We propose a new dataset, DiS-ReX, which alleviates these issues.
Our dataset has more than 1.5 million sentences, spanning across 4 languages with 36 relation classes + 1 no relation (NA) class.
We also modify the widely used bag attention models by encoding sentences using mBERT and provide the first benchmark results on multilingual DS-RE.
arXiv Detail & Related papers (2021-04-17T22:44:38Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.