WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech
Recognition
- URL: http://arxiv.org/abs/2110.03370v1
- Date: Thu, 7 Oct 2021 12:05:29 GMT
- Title: WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech
Recognition
- Authors: Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie,
Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng
- Abstract summary: WenetSpeech is a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech.
We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions.
- Score: 25.31180901037065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present WenetSpeech, a multi-domain Mandarin corpus
consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly
labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in
total. We collect the data from YouTube and Podcast, which covers a variety of
speaking styles, scenarios, domains, topics, and noisy conditions. An optical
character recognition (OCR) based method is introduced to generate the
audio/text segmentation candidates for the YouTube data on its corresponding
video captions, while a high-quality ASR transcription system is used to
generate audio/text pair candidates for the Podcast data. Then we propose a
novel end-to-end label error detection approach to further validate and filter
the candidates. We also provide three manually labelled high-quality test sets
along with WenetSpeech for evaluation -- Dev for cross-validation purpose in
training, Test_Net, collected from Internet for matched test, and
Test\_Meeting, recorded from real meetings for more challenging mismatched
test. Baseline systems trained with WenetSpeech are provided for three popular
speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition
results on the three test sets are also provided as benchmarks. To the best of
our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech
corpus with transcriptions, which benefits research on production-level speech
recognition.
Related papers
- AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a
Speech Recognition Baseline [0.0]
This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus.
TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group.
To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition dataset in the world.
arXiv Detail & Related papers (2022-06-27T09:30:25Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain.
This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z) - GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of
Transcribed Audio [88.20960848885575]
GigaSpeech is a multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training.
Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles.
For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h.
arXiv Detail & Related papers (2021-06-13T04:09:16Z) - What shall we do with an hour of data? Speech recognition for the un-
and under-served languages of Common Voice [0.20774268785384567]
This report describes the methods and results of a three-week sprint to produce deployable speech recognition models for 31 under-served languages of the Common Voice project.
arXiv Detail & Related papers (2021-05-10T21:16:28Z) - Generative Spoken Language Modeling from Raw Audio [42.153136032037175]
Generative spoken language modeling involves learning jointly the acoustic and linguistic characteristics of a language from raw audio only (without text or labels)
We introduce metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated end-to-end tasks.
We test baseline systems consisting of a discrete speech encoder (returning discrete, low, pseudo-text units), a generative language model (trained on pseudo-text units) and a speech decoder.
arXiv Detail & Related papers (2021-02-01T21:41:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.