MultiProSE: A Multi-label Arabic Dataset for Propaganda, Sentiment, and Emotion Detection
- URL: http://arxiv.org/abs/2502.08319v1
- Date: Wed, 12 Feb 2025 11:35:20 GMT
- Title: MultiProSE: A Multi-label Arabic Dataset for Propaganda, Sentiment, and Emotion Detection
- Authors: Lubna Al-Henaki, Hend Al-Khalifa, Abdulmalik Al-Salman, Hajar Alqubayshi, Hind Al-Twailay, Gheeda Alghamdi, Hawra Aljasim,
- Abstract summary: This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date.
For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs)
The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models.
- Score: 0.0
- License:
- Abstract: Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people's opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most- used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To address this gap, the first Arabic dataset for Multi-label Propaganda, Sentiment, and Emotion (MultiProSE) has been introduced. MultiProSE is an open-source extension of the existing Arabic propaganda dataset, ArPro, with the addition of sentiment and emotion annotations for each text. This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date. For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs), including three BERT-based models. The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models and contribute to a deeper understanding of how various opinion dimensions interact in news media1.
Related papers
- BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER, a collection of emotion-annotated datasets in 28 different languages.
We describe the data collection and annotation processes and the challenges of building these datasets.
We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z) - AIN: The Arabic INclusive Large Multimodal Model [71.29419186696138]
AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic.
AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities.
AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools.
arXiv Detail & Related papers (2025-01-31T18:58:20Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.
One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.
Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - Can GPT-4 Identify Propaganda? Annotation and Detection of Propaganda
Spans in News Articles [11.64165958410489]
We develop the largest propaganda dataset to date, comprised of 8K paragraphs from newspaper articles, labeled at the text span level following a taxonomy of 23 propagandistic techniques.
Our work offers the first attempt to understand the performance of large language models (LLMs), using GPT-4, for fine-grained propaganda detection from text.
Results showed that GPT-4's performance degrades as the task moves from simply classifying a paragraph as propagandistic or not, to the fine-grained task of detecting propaganda techniques and their manifestation in text.
arXiv Detail & Related papers (2024-02-27T13:02:19Z) - Exposing propaganda: an analysis of stylistic cues comparing human
annotations and machine classification [0.7749297275724032]
This paper investigates the language of propaganda and its stylistic features.
It presents the PPN dataset, composed of news articles extracted from websites identified as propaganda sources.
We propose different NLP techniques to identify the cues used by the annotators, and to compare them with machine classification.
arXiv Detail & Related papers (2024-02-06T07:51:54Z) - Large Language Models for Propaganda Span Annotation [10.358271919023903]
This study investigates whether Large Language Models, such as GPT-4, can effectively extract propagandistic spans.
The experiments are performed over a large-scale in-house manually annotated dataset.
arXiv Detail & Related papers (2023-11-16T11:37:54Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Dataset of Propaganda Techniques of the State-Sponsored Information
Operation of the People's Republic of China [0.0]
This research aims to bridge the information gap by providing a multi-labeled propaganda techniques dataset in Mandarin based on a state-backed information operation dataset provided by Twitter.
In addition to presenting the dataset, we apply a multi-label text classification using fine-tuned BERT.
arXiv Detail & Related papers (2021-06-14T16:11:13Z) - Sentiment Classification in Swahili Language Using Multilingual BERT [0.04297070083645048]
This study uses the current state-of-the-art model, multilingual BERT, to perform sentiment classification on Swahili datasets.
The data was created by extracting and annotating 8.2k reviews and comments on different social media platforms and the ISEAR emotion dataset.
The model was fine-tuned and achieve the best accuracy of 87.59%.
arXiv Detail & Related papers (2021-04-19T01:47:00Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.