Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use
- URL: http://arxiv.org/abs/2505.21578v1
- Date: Tue, 27 May 2025 08:40:28 GMT
- Title: Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use
- Authors: Titouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier van Dalen,
- Abstract summary: This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech.<n>Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios.
- Score: 15.302106458232878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM, Libriheavy or People's Speech suffer from major limitations including licenses that researchers in the industry cannot use, unreliable transcriptions, incorrect audio data, or the lack of evaluation sets. This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. Featuring hundreds of thousands of speakers with diverse accents and a wide range of speech types (read, spontaneous, talks, clean, noisy), the Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios.
Related papers
- BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition [0.5224038339798622]
We present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset.<n>The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents.<n>We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion.
arXiv Detail & Related papers (2025-04-30T14:08:14Z) - Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
Textless spoken language models struggle to generate plausible speech past tens of seconds.<n>We derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio.<n>SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency.
arXiv Detail & Related papers (2024-12-24T18:56:46Z) - GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.<n>It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Earnings-22: A Practical Benchmark for Accents in the Wild [0.8039067099377079]
We present Earnings-22, a 125 file, 119 hour corpus of English-language earnings calls gathered from global companies.
By examining Individual Word Error Rate (IWER), we find that key speech features impact model performance more for certain accents than others.
arXiv Detail & Related papers (2022-03-29T14:02:57Z) - The People's Speech: A Large-Scale Diverse English Speech Recognition
Dataset for Commercial Usage [1.5213617014998604]
We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set.
We discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora.
arXiv Detail & Related papers (2021-11-17T19:14:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.