The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings
- URL: http://arxiv.org/abs/2405.18115v1
- Date: Tue, 28 May 2024 12:23:39 GMT
- Title: The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings
- Authors: Gili Goldin, Nick Howell, Noam Ordan, Ella Rabinovich, Shuly Wintner,
- Abstract summary: We present the Corpus Knesset, a corpus of Hebrew parliamentary proceedings from 1998 to 2022.
We show that the corpus can be used to examine historical developments in the style of political discussions.
We also investigate some differences between the styles of men and women speakers.
- Score: 3.2405928866433067
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of the speakers, based on a large database of parliament members and factions that we compiled. We discuss the structure and composition of the corpus and the various processing steps we applied to it. To demonstrate the utility of this novel dataset we present two use cases. We show that the corpus can be used to examine historical developments in the style of political discussions by showing a reduction in lexical richness in the proceedings over time. We also investigate some differences between the styles of men and women speakers. These use cases exemplify the potential of the corpus to shed light on important trends in the Israeli society, supporting research in linguistics, political science, communication, law, etc.
Related papers
- SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments [0.12277343096128711]
We provide the SpeakGer data set, consisting of German parliament debates from all 16 federal states of Germany as well as the German Bundestag from 1947-2023.
This data set includes rich meta data in form of information on both reactions from the audience towards the speech as well as information about the speaker's party, their age, their constituency and their party's political alignment.
arXiv Detail & Related papers (2024-10-23T14:00:48Z) - IsraParlTweet: The Israeli Parliamentary and Twitter Resource [2.249916681499244]
We introduce IsraParlTweet, a new linked corpus of Hebrew-language parliamentary discussions from the Knesset (Israeli Parliament) between the years 1992-2023 and Twitter posts made by Members of the Knesset between the years 2008-2023.
IsraParlTweet can be used to conduct a wide variety of quantitative and qualitative analyses and provide valuable insights into political discourse in Israel.
arXiv Detail & Related papers (2024-05-30T17:21:15Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - A Greek Parliament Proceedings Dataset for Computational Linguistics and
Political Analysis [4.396860522241306]
We introduce a curated dataset of the Greek Parliament Proceedings that extends chronologically from 1989 up to 2020.
It consists of more than 1 million speeches with extensive metadata, extracted from 5,355 parliamentary record files.
arXiv Detail & Related papers (2022-10-23T23:23:28Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions [3.4447242282168777]
We release the first version of a newly compiled corpus from Basque parliamentary transcripts.
The corpus is characterized by heavy Basque-Spanish code-switching, and represents an interesting resource to study political discourse in contrasting languages such as Basque and Spanish.
arXiv Detail & Related papers (2022-05-03T14:02:24Z) - German Parliamentary Corpus (GerParCor) [63.17616047204443]
We introduce the German Parliament Corpus (GerParCor)
GerParCor is a genre-specific corpus of German-language parliamentary protocols from three centuries and four countries.
All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date.
arXiv Detail & Related papers (2022-04-21T22:06:55Z) - A Novel Corpus of Discourse Structure in Humans and Computers [55.74664144248097]
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses.
The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
arXiv Detail & Related papers (2021-11-10T20:56:08Z) - Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.