SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets
- URL: http://arxiv.org/abs/2510.08214v1
- Date: Thu, 09 Oct 2025 13:38:05 GMT
- Title: SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets
- Authors: Qiang Yang, Xiuying Chen, Changsheng Ma, Rui Yin, Xin Gao, Xiangliang Zhang,
- Abstract summary: SenWave is a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets.<n>The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets.<n>Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time.
- Score: 42.98177831933239
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The global impact of the COVID-19 pandemic has highlighted the need for a comprehensive understanding of public sentiment and reactions. Despite the availability of numerous public datasets on COVID-19, some reaching volumes of up to 100 billion data points, challenges persist regarding the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we introduce SenWave, a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets, featuring ten sentiment categories across five languages. The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets. Additionally, it includes over 105 million unlabeled tweets collected during various COVID-19 waves. To enable accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models using the labeled tweets. Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time. Furthermore, we assess the compatibility of our dataset with ChatGPT, demonstrating its robustness and versatility in various applications. Our dataset and accompanying code are publicly accessible on the repository\footnote{https://github.com/gitdevqiang/SenWave}. We anticipate that this work will foster further exploration into fine-grained sentiment analysis for complex events within the NLP community, promoting more nuanced understanding and research innovations.
Related papers
- Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data [1.0025691625593705]
This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations.<n>Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean.
arXiv Detail & Related papers (2026-02-19T03:46:11Z) - EmoHopeSpeech: An Annotated Dataset of Emotions and Hope Speech in English and Arabic [0.021665899581403608]
This research introduces a bilingual dataset comprising 23,456 entries for Arabic and 10,036 entries for English, annotated for emotions and hope speech.<n>The dataset provides comprehensive annotations capturing emotion intensity, complexity, and causes, alongside detailed classifications and subcategories for hope speech.
arXiv Detail & Related papers (2025-05-17T11:21:58Z) - SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection [76.18321723846616]
Task covers more than 30 languages from seven distinct language families.<n>Data instances are multi-labeled with six emotional classes, with additional datasets in 11 languages annotated for emotion intensity.<n>Participants were asked to predict labels in three tracks: (a) multilabel emotion detection, (b) emotion intensity score detection, and (c) cross-lingual emotion detection.
arXiv Detail & Related papers (2025-03-10T12:49:31Z) - BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER, a collection of multi-labeled, emotion-annotated datasets in 28 different languages.<n>We highlight the challenges related to the data collection and annotation processes.<n>We show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z) - Bridging the Data Provenance Gap Across Text, Speech and Video [67.72097952282262]
We conduct the largest and first-of-its-kind longitudinal audit across modalities of popular text, speech, and video datasets.<n>Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.<n>We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets.
arXiv Detail & Related papers (2024-12-19T01:30:19Z) - WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English.<n>We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.<n>This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z) - NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual
Sentiment Analysis [5.048355865260207]
We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria.
The dataset consists of around 30,000 annotated tweets per language.
We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.
arXiv Detail & Related papers (2022-01-20T16:28:06Z) - Extracting Feelings of People Regarding COVID-19 by Social Network
Mining [0.0]
dataset of COVID-related tweets in English language is collected.
More than two million tweets from March 23 to June 23 of 2020 are analyzed.
arXiv Detail & Related papers (2021-10-12T16:45:33Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z) - AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech
Detection Dataset [0.0]
"AraCOVID19-MFH" is a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset.
Our dataset contains 10,828 Arabic tweets annotated with 10 different labels.
It can also be used for hate speech detection, opinion/news classification, dialect identification, and many other tasks.
arXiv Detail & Related papers (2021-05-07T09:52:44Z) - SenWave: Monitoring the Global Sentiments under the COVID-19 Pandemic [26.109661374693935]
We introduce SenWave, a novel sentimental analysis work using 105+ million collected tweets and Weibo messages.
SenWave reveals the sentiment of global conversation in six different languages on COVID-19.
Overall, SenWave shows that optimistic and positive sentiments increased over time, foretelling a desire to seek, together, a reset for an improved COVID-19 world.
arXiv Detail & Related papers (2020-06-18T20:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.