EuroSpeech: A Multilingual Speech Corpus
- URL: http://arxiv.org/abs/2510.00514v2
- Date: Sun, 26 Oct 2025 18:50:17 GMT
- Title: EuroSpeech: A Multilingual Speech Corpus
- Authors: Samuel Pfisterer, Florian Grötschla, Luca A. Lanzendörfer, Florian Yan, Roger Wattenhofer,
- Abstract summary: We introduce a scalable pipeline for constructing speech datasets from parliamentary recordings.<n>Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments.<n>We obtain an average 41.8% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset.
- Score: 35.79691721955664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8\% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.
Related papers
- Granary: Speech Recognition and Translation Dataset in 25 European Languages [37.561934855489504]
Granary is a large-scale collection of speech datasets for recognition and translation across 25 European languages.<n>This is the first open-source effort at this scale for both transcription and translation.
arXiv Detail & Related papers (2025-05-19T17:40:58Z) - The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings [0.0]
We present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages.
We focus on three Slavic languages, namely Croatian, Polish, and Serbian.
The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts.
arXiv Detail & Related papers (2024-09-23T10:12:18Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - SpeechMatrix: A Large-Scale Mined Corpus of Multilingual
Speech-to-Speech Translations [38.058120432870126]
SpeechMatrix is a large-scale multilingual corpus of speech-to-speech translations.
It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech.
arXiv Detail & Related papers (2022-11-08T19:09:27Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - CoVoST 2 and Massively Multilingual Speech-to-Text Translation [24.904548615918355]
CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages.
This represents the largest open dataset available to date from total volume and language coverage perspective.
arXiv Detail & Related papers (2020-07-20T17:53:35Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.