ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech
- URL: http://arxiv.org/abs/2511.08247v1
- Date: Wed, 12 Nov 2025 01:48:30 GMT
- Title: ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech
- Authors: Marios Koniaris, Argyro Tsipi, Panayiotis Tsanakas,
- Abstract summary: Parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency.<n>Current language models lack specialized training for parliamentary contexts.<n>We present ParliaBench, a benchmark for parliamentary speech generation.
- Score: 0.2446948464551684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.
Related papers
- MPCEval: A Benchmark for Multi-Party Conversation Generation [23.227067535888768]
We introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation.<n>MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency.<n>We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations.
arXiv Detail & Related papers (2026-03-05T09:07:00Z) - On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation [88.77441715819366]
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content.<n>We propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity.
arXiv Detail & Related papers (2026-01-09T22:01:56Z) - Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation [10.488285141408253]
We introduce a novel framework to model counter-speech generation as knowledge-wise text generation process.<n>Our framework integrates advanced Retrieval-Augmented Generation (RAG) pipelines to ensure the generation of trustworthy counter-speech for 8 main target groups.<n>We use the MultiTarget-CONAN dataset to empirically assess the quality of the generated counter-speech, both through standard metrics and a human evaluation.
arXiv Detail & Related papers (2025-10-14T09:20:01Z) - SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z) - KOKKAI DOC: An LLM-driven framework for scaling parliamentary representatives [0.0]
This paper introduces an LLM-driven framework designed to accurately scale the political issue stances of parliamentary representatives.<n>By leveraging advanced natural language processing techniques and large language models, the proposed methodology refines and enhances previous approaches.<n>The framework incorporates three major innovations: (1) de-noising parliamentary speeches via summarization to produce cleaner, more consistent opinion embeddings; (2) automatic extraction of axes of political controversy from legislators' speech summaries; and (3) a diachronic analysis that tracks the evolution of party positions over time.
arXiv Detail & Related papers (2025-05-11T21:03:53Z) - Positioning Political Texts with Large Language Models by Asking and Averaging [0.0]
We ask an LLM where a tweet or a sentence of a political text stands on the focal dimension and take the average of the LLM responses to position political actors.
The correlations between the position estimates obtained with the best LLMs and benchmarks based on text coding by experts, crowdworkers, or roll call votes exceed.90.
Using instruction-tuned LLMs to position texts in policy and ideological spaces is fast, cost-efficient, reliable, and reproducible (in the case of open LLMs) even if the texts are short and written in different languages.
arXiv Detail & Related papers (2023-11-28T09:45:02Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings [0.0]
The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment.
The paper additionally introduces the first domain-specific multilingual transformer language model for political science applications.
arXiv Detail & Related papers (2023-09-18T14:01:06Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - BabySLM: language-acquisition-friendly benchmark of self-supervised
spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels.
We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels.
We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z) - Political corpus creation through automatic speech recognition on EU
debates [4.670305538969914]
We present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words.
The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata.
We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis.
arXiv Detail & Related papers (2023-04-17T10:41:59Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.