Related papers: SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data

SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data

URL: http://arxiv.org/abs/2509.19270v1
Date: Tue, 23 Sep 2025 17:33:57 GMT
Title: SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data
Authors: Erik Božík, Marek Šuppa,
Abstract summary: SloPalSpeech is a large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings.<n>We use this dataset to fine-tune several OpenAI Whisper models.<n>To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset.
Score: 0.00954904463032233
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model's WER dropped by up to 70\%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.

Related papers

Efficient Interleaved Speech Modeling through Knowledge Distillation [5.389972857470079]
Current speech language models exceed the size and latency constraints of many deployment environments.<n>We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits.<n>TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations.
arXiv Detail & Related papers (2025-06-30T09:47:37Z)
Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
Textless spoken language models struggle to generate plausible speech past tens of seconds.<n>We derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio.<n>SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency.
arXiv Detail & Related papers (2024-12-24T18:56:46Z)
GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.<n>It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z)
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation [55.1650189699753]
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech. We present AV-TranSpeech, the first audio-visual speech-to-speech model without relying on intermediate text.
arXiv Detail & Related papers (2023-05-24T17:59:03Z)
Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech [0.9653976364051563]
We present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech. We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets.
arXiv Detail & Related papers (2022-06-15T16:14:37Z)
Large-Scale Self- and Semi-Supervised Learning for Speech Translation [48.06478781295623]
We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.6 BLEU on average on all four considered CoVoST 2 language pairs.
arXiv Detail & Related papers (2021-04-14T07:44:52Z)
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network [45.59907668722702]
We present SpeechStew, a speech recognition model that is trained on a combination of publicly available speech recognition datasets. Our results include 9.0% WER on AMI-IHM, 4.7% WER on Switchboard, 8.3% WER on CallHome, and 1.3% on WSJ. We also demonstrate that SpeechStew learns powerful transfer learning representations.
arXiv Detail & Related papers (2021-04-05T20:13:36Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.