Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset
- URL: http://arxiv.org/abs/2201.02419v1
- Date: Fri, 7 Jan 2022 12:09:15 GMT
- Title: Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset
- Authors: Tiezheng Yu, Rita Frieske, Peng Xu, Samuel Cahyawijaya, Cheuk Tung
Shadow Yiu, Holy Lovenia, Wenliang Dai, Elham J. Barezi, Qifeng Chen,
Xiaojuan Ma, Bertram E. Shi, Pascale Fung
- Abstract summary: Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
- Score: 85.52036362232688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic speech recognition (ASR) on low resource languages improves access
of linguistic minorities to technological advantages provided by Artificial
Intelligence (AI). In this paper, we address a problem of data scarcity of Hong
Kong Cantonese language by creating a new Cantonese dataset. Our dataset,
Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read
speech paired with transcripts, collected from Cantonese audiobooks from Hong
Kong. It combines philosophy, politics, education, culture, lifestyle and
family domains, covering a wide range of topics. We also review all existing
Cantonese datasets and perform experiments on the two biggest datasets (MDCC
and Common Voice zh-HK). We analyze the existing datasets according to their
speech type, data source, total size and availability. The results of
experiments conducted with Fairseq S2T Transformer, a state-of-the-art ASR
model, show the effectiveness of our dataset. In addition, we create a powerful
and robust Cantonese ASR model by applying multi-dataset learning on MDCC and
Common Voice zh-HK.
Related papers
- The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings [0.0]
We present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages.
We focus on three Slavic languages, namely Croatian, Polish, and Serbian.
The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts.
arXiv Detail & Related papers (2024-09-23T10:12:18Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Bengali Common Voice Speech Dataset for Automatic Speech Recognition [0.9218853132156671]
Bengali is one of the most spoken languages in the world with over 300 million speakers globally.
Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets.
We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
arXiv Detail & Related papers (2022-06-28T14:52:08Z) - Exploring Capabilities of Monolingual Audio Transformers using Large
Datasets in Automatic Speech Recognition of Czech [0.9653976364051563]
We present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech.
We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets.
arXiv Detail & Related papers (2022-06-15T16:14:37Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.