Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
- URL: http://arxiv.org/abs/2602.16516v1
- Date: Wed, 18 Feb 2026 15:04:30 GMT
- Title: Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
- Authors: Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić,
- Abstract summary: ParlaCAP is a large-scale dataset for analyzing parliamentary agenda setting across Europe.<n>This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe.
- Score: 0.5666456827479577
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.
Related papers
- ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework [78.07201802874529]
ShifCon is a Shift-based multilingual Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one.<n>Experiments demonstrate that our ShifCon framework significantly enhances the performance of non-dominant languages.
arXiv Detail & Related papers (2024-10-25T10:28:59Z) - The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings [0.0]
We present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages.
We focus on three Slavic languages, namely Croatian, Polish, and Serbian.
The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts.
arXiv Detail & Related papers (2024-09-23T10:12:18Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - Towards a Deep Understanding of Multilingual End-to-End Speech
Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages.
We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z) - Multilingual estimation of political-party positioning: From label
aggregation to long-input Transformers [3.651047982634467]
We implement and compare two approaches to automatic scaling analysis of political-party manifestos.
We find that the task can be efficiently solved by state-of-the-art models, with label aggregation producing the best results.
arXiv Detail & Related papers (2023-10-19T08:34:48Z) - Speaker attribution in German parliamentary debates with QLoRA-adapted
large language models [0.0]
We study the potential of the large language model family Llama 2 to automate speaker attribution in German parliamentary debates from 2017-2021.
Our results shed light on the capabilities of large language models in automating speaker attribution, revealing a promising avenue for computational analysis of political discourse and the development of semantic role labeling systems.
arXiv Detail & Related papers (2023-09-18T16:06:16Z) - The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings [0.0]
The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment.
The paper additionally introduces the first domain-specific multilingual transformer language model for political science applications.
arXiv Detail & Related papers (2023-09-18T14:01:06Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Political corpus creation through automatic speech recognition on EU
debates [4.670305538969914]
We present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words.
The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata.
We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis.
arXiv Detail & Related papers (2023-04-17T10:41:59Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.