ASR Bundestag: A Large-Scale political debate dataset in German
- URL: http://arxiv.org/abs/2302.06008v1
- Date: Sun, 12 Feb 2023 21:45:18 GMT
- Title: ASR Bundestag: A Large-Scale political debate dataset in German
- Authors: Johannes Wirth, Ren\'e Peinl
- Abstract summary: We present ASR Bundestag, a dataset for automatic speech recognition in German.
The dataset consists of 610 hours of aligned audio-transcript pairs for supervised training as well as 1,038 hours of unlabeled audio snippets for self-supervised learning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present ASR Bundestag, a dataset for automatic speech recognition in
German, consisting of 610 hours of aligned audio-transcript pairs for
supervised training as well as 1,038 hours of unlabeled audio snippets for
self-supervised learning, based on raw audio data and transcriptions from
plenary sessions and committee meetings of the German parliament. In addition,
we discuss utilized approaches for the automated creation of speech datasets
and assess the quality of the resulting dataset based on evaluations and
finetuning of a pre-trained state of the art model. We make the dataset
publicly available, including all subsets.
Related papers
- Towards measuring fairness in speech recognition: Fair-Speech dataset [14.703638352216132]
This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information.
Our dataset includes approximately 26.5K utterances in recorded speech by 593 people in the United States, who were paid to record and submit audios of themselves saying voice commands.
arXiv Detail & Related papers (2024-08-22T20:55:17Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Political corpus creation through automatic speech recognition on EU
debates [4.670305538969914]
We present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words.
The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata.
We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis.
arXiv Detail & Related papers (2023-04-17T10:41:59Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - Exploring Capabilities of Monolingual Audio Transformers using Large
Datasets in Automatic Speech Recognition of Czech [0.9653976364051563]
We present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech.
We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets.
arXiv Detail & Related papers (2022-06-15T16:14:37Z) - Finnish Parliament ASR corpus - Analysis, benchmarks and statistics [11.94655679070282]
The Finnish parliament is the largest publicly available collection of manually transcribed speech data for Finnish with over 3000 hours of speech and 449 speakers.
This corpus builds on earlier initial work, and as a result the corpus has a natural split into two training subsets from two periods of time.
We develop a complete Kaldi-based data preparation pipeline, and hidden Markov model (HMM), hybrid deep neural network (HMM-DNN) and attention-based encoder-decoder (AED) ASR recipes.
arXiv Detail & Related papers (2022-03-28T16:29:49Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.