Related papers: The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

URL: http://arxiv.org/abs/2111.09344v1
Date: Wed, 17 Nov 2021 19:14:40 GMT
Title: The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
Authors: Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cer\'on, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, Vijay Janapa Reddi
Abstract summary: We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set. We discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora.
Score: 1.5213617014998604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache 2.0 license. We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set.Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons's sponsorship.

Related papers

Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use [15.302106458232878]
This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech.<n>Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios.
arXiv Detail & Related papers (2025-05-27T08:40:28Z)
EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z)
YODAS: Youtube-Oriented Dataset for Audio and Speech [47.60574092241447]
YODAS is a large-scale, multilingual dataset comprising over 500k hours of speech data in more than 100 languages. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license.
arXiv Detail & Related papers (2024-06-02T23:43:27Z)
Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages [20.25236081418051]
Zambezi Voice is an open-source multilingual speech resource for Zambian languages. To our knowledge, this is the first multilingual speech dataset created for Zambian languages.
arXiv Detail & Related papers (2023-06-07T13:36:37Z)
Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language. We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z)
Automatic Speech Recognition Datasets in Cantonese Language: A Survey and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z)
Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language. We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z)
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English. It diversified with over 11,000 speakers and over 60 accents. CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.