Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset
- URL: http://arxiv.org/abs/2201.02419v1
- Date: Fri, 7 Jan 2022 12:09:15 GMT
- Title: Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset
- Authors: Tiezheng Yu, Rita Frieske, Peng Xu, Samuel Cahyawijaya, Cheuk Tung
Shadow Yiu, Holy Lovenia, Wenliang Dai, Elham J. Barezi, Qifeng Chen,
Xiaojuan Ma, Bertram E. Shi, Pascale Fung
- Abstract summary: Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
- Score: 85.52036362232688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic speech recognition (ASR) on low resource languages improves access
of linguistic minorities to technological advantages provided by Artificial
Intelligence (AI). In this paper, we address a problem of data scarcity of Hong
Kong Cantonese language by creating a new Cantonese dataset. Our dataset,
Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read
speech paired with transcripts, collected from Cantonese audiobooks from Hong
Kong. It combines philosophy, politics, education, culture, lifestyle and
family domains, covering a wide range of topics. We also review all existing
Cantonese datasets and perform experiments on the two biggest datasets (MDCC
and Common Voice zh-HK). We analyze the existing datasets according to their
speech type, data source, total size and availability. The results of
experiments conducted with Fairseq S2T Transformer, a state-of-the-art ASR
model, show the effectiveness of our dataset. In addition, we create a powerful
and robust Cantonese ASR model by applying multi-dataset learning on MDCC and
Common Voice zh-HK.
Related papers
- A Large-scale Dataset for Audio-Language Representation Learning [54.933479346870506]
We present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs.
We construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - An Automatic Speech Recognition System for Bengali Language based on
Wav2Vec2 and Transfer Learning [0.0]
This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework.
The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
arXiv Detail & Related papers (2022-09-16T18:20:16Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Bengali Common Voice Speech Dataset for Automatic Speech Recognition [0.9218853132156671]
Bengali is one of the most spoken languages in the world with over 300 million speakers globally.
Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets.
We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
arXiv Detail & Related papers (2022-06-28T14:52:08Z) - Exploring Capabilities of Monolingual Audio Transformers using Large
Datasets in Automatic Speech Recognition of Czech [0.9653976364051563]
We present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech.
We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets.
arXiv Detail & Related papers (2022-06-15T16:14:37Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.