The People's Speech: A Large-Scale Diverse English Speech Recognition
Dataset for Commercial Usage
- URL: http://arxiv.org/abs/2111.09344v1
- Date: Wed, 17 Nov 2021 19:14:40 GMT
- Title: The People's Speech: A Large-Scale Diverse English Speech Recognition
Dataset for Commercial Usage
- Authors: Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cer\'on, Keith
Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, Vijay
Janapa Reddi
- Abstract summary: We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set.
We discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora.
- Score: 1.5213617014998604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The People's Speech is a free-to-download 30,000-hour and growing supervised
conversational English speech recognition dataset licensed for academic and
commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected
via searching the Internet for appropriately licensed audio data with existing
transcriptions. We describe our data collection methodology and release our
data collection system under the Apache 2.0 license. We show that a model
trained on this dataset achieves a 9.98% word error rate on Librispeech's
test-clean test set.Finally, we discuss the legal and ethical issues
surrounding the creation of a sizable machine learning corpora and plans for
continued maintenance of the project under MLCommons's sponsorship.
Related papers
- EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - YODAS: Youtube-Oriented Dataset for Audio and Speech [47.60574092241447]
YODAS is a large-scale, multilingual dataset comprising over 500k hours of speech data in more than 100 languages.
The labeled subsets, including manual or automatic subtitles, facilitate supervised model training.
YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license.
arXiv Detail & Related papers (2024-06-02T23:43:27Z) - Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages [20.25236081418051]
Zambezi Voice is an open-source multilingual speech resource for Zambian languages.
To our knowledge, this is the first multilingual speech dataset created for Zambian languages.
arXiv Detail & Related papers (2023-06-07T13:36:37Z) - Textless Low-Resource Speech-to-Speech Translation With Unit Language
Models [56.1058530241461]
We present a new framework for training textless low-resource speech-to-speech translation (S2ST) systems.
We finetune S2ST as a unit-to-unit seq2seq translation task, and start by pretraining a model on large-scale monolingual speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain.
This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.