Related papers: ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

URL: http://arxiv.org/abs/2511.01619v1
Date: Mon, 03 Nov 2025 14:27:42 GMT
Title: ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian
Authors: Nikola Ljubešić, Peter Rupnik, Ivan Porupski, Taja Kuzman Pungeršek,
Abstract summary: ParlaSpeech is a collection of spoken parliamentary corpora spanning four Slavic languages - Croatian, Czech, Polish and Serbian.<n>The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament.
Score: 0.5666456827479577
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

Related papers

MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs [13.58291341556655]
We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries.<n>We build a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, and (iv) generates about 417K role-play conversations.
arXiv Detail & Related papers (2026-02-03T10:22:27Z)
Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings [0.45498315114762483]
We present a corpus of cross-lingual dialogues facilitated by automatic simultaneous speech translation.<n>The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English.<n>For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings.
arXiv Detail & Related papers (2025-12-23T09:56:23Z)
PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs [58.2469845374385]
We introduce Progressive Alignment Representation Training (PART)<n>PART is a multi-stage and multi-task framework that separates within-language from cross-language alignment.<n>Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches.
arXiv Detail & Related papers (2025-09-24T03:54:14Z)
The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings [0.0]
We present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages. We focus on three Slavic languages, namely Croatian, Polish, and Serbian. The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts.
arXiv Detail & Related papers (2024-09-23T10:12:18Z)
Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech. We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z)
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation [4.450536872346658]
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. All the corpora were linguistically annotated with the state-of-the-art CLASSLA-Stanza linguistic processing pipeline.
arXiv Detail & Related papers (2024-03-19T13:30:47Z)
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language [7.0944623704102625]
We show that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences.
arXiv Detail & Related papers (2023-11-14T17:09:07Z)
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages. We developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z)
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z)
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations [38.058120432870126]
SpeechMatrix is a large-scale multilingual corpus of speech-to-speech translations. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech.
arXiv Detail & Related papers (2022-11-08T19:09:27Z)
XTREME-S: Evaluating Cross-lingual Speech Representations [88.78720838743772]
XTREME-S is a new benchmark to evaluate universal cross-lingual speech representations in many languages. This paper describes the new benchmark and establishes the first speech-only and speech-text baselines.
arXiv Detail & Related papers (2022-03-21T06:50:21Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.