AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI
- URL: http://arxiv.org/abs/2501.06265v1
- Date: Thu, 09 Jan 2025 18:17:59 GMT
- Title: AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI
- Authors: Pavlos Sermpezis, Stelios Karamanidis, Eva Paraschou, Ilias Dimitriadis, Sofia Yfantidou, Filitsa-Ioanna Kouskouveli, Thanasis Troboukis, Kelly Kiki, Antonis Galanopoulos, Athena Vakali,
- Abstract summary: AgoraSpeech is a meticulously curated, high-quality dataset of 171 political speeches from six parties during the Greek national elections in 2023.
The dataset includes annotations (per paragraph) for six natural language processing (NLP) tasks: text classification, topic identification, sentiment analysis, named entity recognition, polarization and populism detection.
- Score: 1.3060410279656598
- License:
- Abstract: Political discourse datasets are important for gaining political insights, analyzing communication strategies or social science phenomena. Although numerous political discourse corpora exist, comprehensive, high-quality, annotated datasets are scarce. This is largely due to the substantial manual effort, multidisciplinarity, and expertise required for the nuanced annotation of rhetorical strategies and ideological contexts. In this paper, we present AgoraSpeech, a meticulously curated, high-quality dataset of 171 political speeches from six parties during the Greek national elections in 2023. The dataset includes annotations (per paragraph) for six natural language processing (NLP) tasks: text classification, topic identification, sentiment analysis, named entity recognition, polarization and populism detection. A two-step annotation was employed, starting with ChatGPT-generated annotations and followed by exhaustive human-in-the-loop validation. The dataset was initially used in a case study to provide insights during the pre-election period. However, it has general applicability by serving as a rich source of information for political and social scientists, journalists, or data scientists, while it can be used for benchmarking and fine-tuning NLP and large language models (LLMs).
Related papers
- Political-LLM: Large Language Models in Political Science [159.95299889946637]
Large language models (LLMs) have been widely adopted in political science tasks.
Political-LLM aims to advance the comprehensive understanding of integrating LLMs into computational political science.
arXiv Detail & Related papers (2024-12-09T08:47:50Z) - SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments [0.12277343096128711]
We provide the SpeakGer data set, consisting of German parliament debates from all 16 federal states of Germany as well as the German Bundestag from 1947-2023.
This data set includes rich meta data in form of information on both reactions from the audience towards the speech as well as information about the speaker's party, their age, their constituency and their party's political alignment.
arXiv Detail & Related papers (2024-10-23T14:00:48Z) - Language Models Learn Metadata: Political Stance Detection Case Study [1.2277343096128712]
This paper investigates the optimal way to incorporate metadata into a political stance detection task.
We show that our simple baseline, using only party membership information, surpasses the current state-of-the-art.
arXiv Detail & Related papers (2024-09-15T14:57:41Z) - Combining Objective and Subjective Perspectives for Political News Understanding [5.741243797283764]
We introduce a text analysis framework which integrates both perspectives and provides a fine-grained processing of subjective aspects.
We illustrate its functioning with insights on news outlets, political orientations, topics, individual entities, and demographic segments.
arXiv Detail & Related papers (2024-08-20T20:13:19Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Improving Mandarin Prosodic Structure Prediction with Multi-level
Contextual Information [68.89000132126536]
This work proposes to use inter-utterance linguistic information to improve the performance of prosodic structure prediction (PSP)
Our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH)
arXiv Detail & Related papers (2023-08-31T09:19:15Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - A Greek Parliament Proceedings Dataset for Computational Linguistics and
Political Analysis [4.396860522241306]
We introduce a curated dataset of the Greek Parliament Proceedings that extends chronologically from 1989 up to 2020.
It consists of more than 1 million speeches with extensive metadata, extracted from 5,355 parliamentary record files.
arXiv Detail & Related papers (2022-10-23T23:23:28Z) - O-Dang! The Ontology of Dangerous Speech Messages [53.15616413153125]
We present O-Dang!: The Ontology of Dangerous Speech Messages, a systematic and interoperable Knowledge Graph (KG)
O-Dang! is designed to gather and organize Italian datasets into a structured KG, according to the principles shared within the Linguistic Linked Open Data community.
It provides a model for encoding both gold standard and single-annotator labels in the KG.
arXiv Detail & Related papers (2022-07-13T11:50:05Z) - Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset
for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv.
It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation.
The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.