SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments
- URL: http://arxiv.org/abs/2410.17886v1
- Date: Wed, 23 Oct 2024 14:00:48 GMT
- Title: SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments
- Authors: Kai-Robin Lange, Carsten Jentsch,
- Abstract summary: We provide the SpeakGer data set, consisting of German parliament debates from all 16 federal states of Germany as well as the German Bundestag from 1947-2023.
This data set includes rich meta data in form of information on both reactions from the audience towards the speech as well as information about the speaker's party, their age, their constituency and their party's political alignment.
- Score: 0.12277343096128711
- License:
- Abstract: The application of natural language processing on political texts as well as speeches has become increasingly relevant in political sciences due to the ability to analyze large text corpora which cannot be read by a single person. But such text corpora often lack critical meta information, detailing for instance the party, age or constituency of the speaker, that can be used to provide an analysis tailored to more fine-grained research questions. To enable researchers to answer such questions with quantitative approaches such as natural language processing, we provide the SpeakGer data set, consisting of German parliament debates from all 16 federal states of Germany as well as the German Bundestag from 1947-2023, split into a total of 10,806,105 speeches. This data set includes rich meta data in form of information on both reactions from the audience towards the speech as well as information about the speaker's party, their age, their constituency and their party's political alignment, which enables a deeper analysis. We further provide three exploratory analyses, detailing topic shares of different parties throughout time, a descriptive analysis of the development of the age of an average speaker as well as a sentiment analysis of speeches of different parties with regards to the COVID-19 pandemic.
Related papers
- L(u)PIN: LLM-based Political Ideology Nowcasting [1.124958340749622]
We present a method to analyze ideological positions of individual parliamentary representatives by leveraging the latent knowledge of LLMs.
The method allows us to evaluate the stance of politicians on an axis of our choice allowing us to flexibly measure the stance of politicians in regards to a topic/controversy of our choice.
arXiv Detail & Related papers (2024-05-12T16:14:07Z) - Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - Speaker attribution in German parliamentary debates with QLoRA-adapted
large language models [0.0]
We study the potential of the large language model family Llama 2 to automate speaker attribution in German parliamentary debates from 2017-2021.
Our results shed light on the capabilities of large language models in automating speaker attribution, revealing a promising avenue for computational analysis of political discourse and the development of semantic role labeling systems.
arXiv Detail & Related papers (2023-09-18T16:06:16Z) - The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings [0.0]
The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment.
The paper additionally introduces the first domain-specific multilingual transformer language model for political science applications.
arXiv Detail & Related papers (2023-09-18T14:01:06Z) - Improving Mandarin Prosodic Structure Prediction with Multi-level
Contextual Information [68.89000132126536]
This work proposes to use inter-utterance linguistic information to improve the performance of prosodic structure prediction (PSP)
Our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH)
arXiv Detail & Related papers (2023-08-31T09:19:15Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - Multi-aspect Multilingual and Cross-lingual Parliamentary Speech
Analysis [1.759288298635146]
We apply advanced NLP methods to a joint and comparative analysis of six national parliaments between 2017 and 2020.
We analyze emotions and sentiment in the transcripts from the ParlaMint dataset collection.
The results show some commonalities and many surprising differences among the analyzed countries.
arXiv Detail & Related papers (2022-07-03T14:31:32Z) - BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions [3.4447242282168777]
We release the first version of a newly compiled corpus from Basque parliamentary transcripts.
The corpus is characterized by heavy Basque-Spanish code-switching, and represents an interesting resource to study political discourse in contrasting languages such as Basque and Spanish.
arXiv Detail & Related papers (2022-05-03T14:02:24Z) - German Parliamentary Corpus (GerParCor) [63.17616047204443]
We introduce the German Parliament Corpus (GerParCor)
GerParCor is a genre-specific corpus of German-language parliamentary protocols from three centuries and four countries.
All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date.
arXiv Detail & Related papers (2022-04-21T22:06:55Z) - Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset
for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv.
It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation.
The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z) - Unsupervised Speech Decomposition via Triple Information Bottleneck [63.55007056410914]
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.
We propose SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.
arXiv Detail & Related papers (2020-04-23T16:12:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.