Related papers: BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions

BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions

URL: http://arxiv.org/abs/2205.01506v1
Date: Tue, 3 May 2022 14:02:24 GMT
Title: BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions
Authors: Nayla Escribano, Jon Ander Gonz\'alez, Julen Orbegozo-Terradillos, Ainara Larrondo-Ureta, Sim\'on Pe\~na-Fern\'andez, Olatz Perez-de-Vi\~naspre and Rodrigo Agerri
Abstract summary: We release the first version of a newly compiled corpus from Basque parliamentary transcripts. The corpus is characterized by heavy Basque-Spanish code-switching, and represents an interesting resource to study political discourse in contrasting languages such as Basque and Spanish.
Score: 3.4447242282168777
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Parliamentary transcripts provide a valuable resource to understand the reality and know about the most important facts that occur over time in our societies. Furthermore, the political debates captured in these transcripts facilitate research on political discourse from a computational social science perspective. In this paper we release the first version of a newly compiled corpus from Basque parliamentary transcripts. The corpus is characterized by heavy Basque-Spanish code-switching, and represents an interesting resource to study political discourse in contrasting languages such as Basque and Spanish. We enrich the corpus with metadata related to relevant attributes of the speakers and speeches (language, gender, party...) and process the text to obtain named entities and lemmas. The obtained metadata is then used to perform a detailed corpus analysis which provides interesting insights about the language use of the Basque political representatives across time, parties and gender.

Related papers

A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes [1.4745280175321207]
We present a corpus of original texts for Spanish, Catalan and Italian languages.<n>It was developed within the iDEM project to assess the impact of Easy-to-Read (E2R) language for democratic participation.
arXiv Detail & Related papers (2026-03-05T16:21:25Z)
SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments [0.12277343096128711]
We provide the SpeakGer data set, consisting of German parliament debates from all 16 federal states of Germany as well as the German Bundestag from 1947-2023. This data set includes rich meta data in form of information on both reactions from the audience towards the speech as well as information about the speaker's party, their age, their constituency and their party's political alignment.
arXiv Detail & Related papers (2024-10-23T14:00:48Z)
The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings [3.2405928866433067]
We present the Corpus Knesset, a corpus of Hebrew parliamentary proceedings from 1998 to 2022. We show that the corpus can be used to examine historical developments in the style of political discussions. We also investigate some differences between the styles of men and women speakers.
arXiv Detail & Related papers (2024-05-28T12:23:39Z)
Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies. We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z)
KamerRaad: Enhancing Information Retrieval in Belgian National Politics through Hierarchical Summarization and Conversational Interfaces [55.00702535694059]
KamerRaad is an AI tool that leverages large language models to help citizens interactively engage with Belgian political information. The tool extracts and concisely summarizes key excerpts from parliamentary proceedings, followed by the potential for interaction based on generative AI.
arXiv Detail & Related papers (2024-04-22T15:01:39Z)
Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z)
Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information [68.89000132126536]
This work proposes to use inter-utterance linguistic information to improve the performance of prosodic structure prediction (PSP) Our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH)
arXiv Detail & Related papers (2023-08-31T09:19:15Z)
Political corpus creation through automatic speech recognition on EU debates [4.670305538969914]
We present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words. The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata. We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis.
arXiv Detail & Related papers (2023-04-17T10:41:59Z)
Multi-aspect Multilingual and Cross-lingual Parliamentary Speech Analysis [1.759288298635146]
We apply advanced NLP methods to a joint and comparative analysis of six national parliaments between 2017 and 2020. We analyze emotions and sentiment in the transcripts from the ParlaMint dataset collection. The results show some commonalities and many surprising differences among the analyzed countries.
arXiv Detail & Related papers (2022-07-03T14:31:32Z)
Who is we? Disambiguating the referents of first person plural pronouns in parliamentary debates [9.09904590211839]
We present an annotation schema for disambiguating pronoun references and use our schema to create an annotated corpus of debates from the German Bundestag. We then use our corpus to learn to automatically resolve pronoun referents in parliamentary debates.
arXiv Detail & Related papers (2022-05-27T18:18:04Z)
German Parliamentary Corpus (GerParCor) [63.17616047204443]
We introduce the German Parliament Corpus (GerParCor) GerParCor is a genre-specific corpus of German-language parliamentary protocols from three centuries and four countries. All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date.
arXiv Detail & Related papers (2022-04-21T22:06:55Z)
A Novel Corpus of Discourse Structure in Humans and Computers [55.74664144248097]
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses. The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
arXiv Detail & Related papers (2021-11-10T20:56:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.