German Parliamentary Corpus (GerParCor)
- URL: http://arxiv.org/abs/2204.10422v1
- Date: Thu, 21 Apr 2022 22:06:55 GMT
- Title: German Parliamentary Corpus (GerParCor)
- Authors: Giuseppe Abrami, Mevl\"ut Bagci, Leon Hammerla, Alexander Mehler
- Abstract summary: We introduce the German Parliament Corpus (GerParCor)
GerParCor is a genre-specific corpus of German-language parliamentary protocols from three centuries and four countries.
All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date.
- Score: 63.17616047204443
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parliamentary debates represent a large and partly unexploited treasure trove
of publicly accessible texts. In the German-speaking area, there is a certain
deficit of uniformly accessible and annotated corpora covering all
German-speaking parliaments at the national and federal level. To address this
gap, we introduce the German Parliament Corpus (GerParCor). GerParCor is a
genre-specific corpus of (predominantly historical) German-language
parliamentary protocols from three centuries and four countries, including
state and federal level data. In addition, GerParCor contains conversions of
scanned protocols and, in particular, of protocols in Fraktur converted via an
OCR process based on Tesseract. All protocols were preprocessed by means of the
NLP pipeline of spaCy3 and automatically annotated with metadata regarding
their session date. GerParCor is made available in the XMI format of the UIMA
project. In this way, GerParCor can be used as a large corpus of historical
texts in the field of political communication for various tasks in NLP.
Related papers
- SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments [0.12277343096128711]
We provide the SpeakGer data set, consisting of German parliament debates from all 16 federal states of Germany as well as the German Bundestag from 1947-2023.
This data set includes rich meta data in form of information on both reactions from the audience towards the speech as well as information about the speaker's party, their age, their constituency and their party's political alignment.
arXiv Detail & Related papers (2024-10-23T14:00:48Z) - The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings [0.0]
We present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages.
We focus on three Slavic languages, namely Croatian, Polish, and Serbian.
The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts.
arXiv Detail & Related papers (2024-09-23T10:12:18Z) - The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings [3.2405928866433067]
We present the Corpus Knesset, a corpus of Hebrew parliamentary proceedings from 1998 to 2022.
We show that the corpus can be used to examine historical developments in the style of political discussions.
We also investigate some differences between the styles of men and women speakers.
arXiv Detail & Related papers (2024-05-28T12:23:39Z) - MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - MUG: A General Meeting Understanding and Generation Benchmark [60.09540662936726]
We build the AliMeeting4MUG Corpus, which consists of 654 recorded Mandarin meeting sessions with diverse topic coverage.
In this paper, we provide a detailed introduction of this corpus, SLP tasks and evaluation methods, baseline systems and their performance.
arXiv Detail & Related papers (2023-03-24T11:52:25Z) - BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions [3.4447242282168777]
We release the first version of a newly compiled corpus from Basque parliamentary transcripts.
The corpus is characterized by heavy Basque-Spanish code-switching, and represents an interesting resource to study political discourse in contrasting languages such as Basque and Spanish.
arXiv Detail & Related papers (2022-05-03T14:02:24Z) - DEBACER: a method for slicing moderated debates [55.705662163385966]
Partitioning debates into blocks with the same subject is essential for understanding.
We propose a new algorithm, DEBACER, which partitions moderated debates.
arXiv Detail & Related papers (2021-12-10T10:39:07Z) - Persian Rhetorical Structure Theory [2.610470075814367]
We present a discourse-annotated corpus for the Persian language built in the framework of Rhetorical Theory.
Our corpus consists of 150 journalistic texts, each text having an average of around 400 words.
Our text-level discourse is trained using gold segmentation and is built upon the DPLP discoursebank.
arXiv Detail & Related papers (2021-06-25T18:15:47Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - Unsupervised Speech Decomposition via Triple Information Bottleneck [63.55007056410914]
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.
We propose SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.
arXiv Detail & Related papers (2020-04-23T16:12:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.