BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions
- URL: http://arxiv.org/abs/2205.01506v1
- Date: Tue, 3 May 2022 14:02:24 GMT
- Title: BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions
- Authors: Nayla Escribano, Jon Ander Gonz\'alez, Julen Orbegozo-Terradillos,
Ainara Larrondo-Ureta, Sim\'on Pe\~na-Fern\'andez, Olatz Perez-de-Vi\~naspre
and Rodrigo Agerri
- Abstract summary: We release the first version of a newly compiled corpus from Basque parliamentary transcripts.
The corpus is characterized by heavy Basque-Spanish code-switching, and represents an interesting resource to study political discourse in contrasting languages such as Basque and Spanish.
- Score: 3.4447242282168777
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Parliamentary transcripts provide a valuable resource to understand the
reality and know about the most important facts that occur over time in our
societies. Furthermore, the political debates captured in these transcripts
facilitate research on political discourse from a computational social science
perspective. In this paper we release the first version of a newly compiled
corpus from Basque parliamentary transcripts. The corpus is characterized by
heavy Basque-Spanish code-switching, and represents an interesting resource to
study political discourse in contrasting languages such as Basque and Spanish.
We enrich the corpus with metadata related to relevant attributes of the
speakers and speeches (language, gender, party...) and process the text to
obtain named entities and lemmas. The obtained metadata is then used to perform
a detailed corpus analysis which provides interesting insights about the
language use of the Basque political representatives across time, parties and
gender.
Related papers
- SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments [0.12277343096128711]
We provide the SpeakGer data set, consisting of German parliament debates from all 16 federal states of Germany as well as the German Bundestag from 1947-2023.
This data set includes rich meta data in form of information on both reactions from the audience towards the speech as well as information about the speaker's party, their age, their constituency and their party's political alignment.
arXiv Detail & Related papers (2024-10-23T14:00:48Z) - The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings [3.2405928866433067]
We present the Corpus Knesset, a corpus of Hebrew parliamentary proceedings from 1998 to 2022.
We show that the corpus can be used to examine historical developments in the style of political discussions.
We also investigate some differences between the styles of men and women speakers.
arXiv Detail & Related papers (2024-05-28T12:23:39Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - Improving Mandarin Prosodic Structure Prediction with Multi-level
Contextual Information [68.89000132126536]
This work proposes to use inter-utterance linguistic information to improve the performance of prosodic structure prediction (PSP)
Our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH)
arXiv Detail & Related papers (2023-08-31T09:19:15Z) - Political corpus creation through automatic speech recognition on EU
debates [4.670305538969914]
We present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words.
The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata.
We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis.
arXiv Detail & Related papers (2023-04-17T10:41:59Z) - Multi-aspect Multilingual and Cross-lingual Parliamentary Speech
Analysis [1.759288298635146]
We apply advanced NLP methods to a joint and comparative analysis of six national parliaments between 2017 and 2020.
We analyze emotions and sentiment in the transcripts from the ParlaMint dataset collection.
The results show some commonalities and many surprising differences among the analyzed countries.
arXiv Detail & Related papers (2022-07-03T14:31:32Z) - Who is we? Disambiguating the referents of first person plural pronouns
in parliamentary debates [9.09904590211839]
We present an annotation schema for disambiguating pronoun references and use our schema to create an annotated corpus of debates from the German Bundestag.
We then use our corpus to learn to automatically resolve pronoun referents in parliamentary debates.
arXiv Detail & Related papers (2022-05-27T18:18:04Z) - German Parliamentary Corpus (GerParCor) [63.17616047204443]
We introduce the German Parliament Corpus (GerParCor)
GerParCor is a genre-specific corpus of German-language parliamentary protocols from three centuries and four countries.
All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date.
arXiv Detail & Related papers (2022-04-21T22:06:55Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Novel Corpus of Discourse Structure in Humans and Computers [55.74664144248097]
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses.
The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
arXiv Detail & Related papers (2021-11-10T20:56:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.