GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of
Transcribed Audio
- URL: http://arxiv.org/abs/2106.06909v1
- Date: Sun, 13 Jun 2021 04:09:16 GMT
- Title: GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of
Transcribed Audio
- Authors: Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang,
Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev
Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen
Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan
- Abstract summary: GigaSpeech is a multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training.
Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles.
For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h.
- Score: 88.20960848885575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces GigaSpeech, an evolving, multi-domain English speech
recognition corpus with 10,000 hours of high quality labeled audio suitable for
supervised training, and 40,000 hours of total audio suitable for
semi-supervised and unsupervised training. Around 40,000 hours of transcribed
audio is first collected from audiobooks, podcasts and YouTube, covering both
read and spontaneous speaking styles, and a variety of topics, such as arts,
science, sports, etc. A new forced alignment and segmentation pipeline is
proposed to create sentence segments suitable for speech recognition training,
and to filter out segments with low-quality transcription. For system training,
GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h,
and 10000h. For our 10,000-hour XL training subset, we cap the word error rate
at 4% during the filtering/validation stage, and for all our other smaller
training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the
other hand, are re-processed by professional human transcribers to ensure high
transcription quality. Baseline systems are provided for popular speech
recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
Related papers
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.
It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech
Recognition [25.31180901037065]
WenetSpeech is a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech.
We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions.
arXiv Detail & Related papers (2021-10-07T12:05:29Z) - Injecting Text in Self-Supervised Speech Pretraining [33.676479965610774]
We propose to jointly learn representations during pretraining from two different modalities: speech and text.
tts4pretrain complements the power of contrastive learning in self-supervision.
We demonstrate Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task.
arXiv Detail & Related papers (2021-08-27T11:36:40Z) - QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain.
This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - SPGISpeech: 5,000 hours of transcribed financial audio for fully
formatted end-to-end speech recognition [38.96077127913159]
In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters.
Here we propose a new STT task: end-to-end neural transcription with fully formatted text for target labels.
We present baseline Conformer-based models trained on a corpus of 5,000 hours of professionally transcribed earnings calls, achieving a CER of 1.7.
arXiv Detail & Related papers (2021-04-05T17:05:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.