Bengali Common Voice Speech Dataset for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2206.14053v2
- Date: Wed, 29 Jun 2022 15:34:23 GMT
- Title: Bengali Common Voice Speech Dataset for Automatic Speech Recognition
- Authors: Samiul Alam, Asif Sushmit, Zaowad Abdullah, Shahrin Nakkhatra, MD.
Nazmuddoha Ansary, Syed Mobassir Hossen, Sazia Morshed Mehnaz, Tahsin Reasat,
Ahmed Imtiaz Humayun
- Abstract summary: Bengali is one of the most spoken languages in the world with over 300 million speakers globally.
Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets.
We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
- Score: 0.9218853132156671
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Bengali is one of the most spoken languages in the world with over 300
million speakers globally. Despite its popularity, research into the
development of Bengali speech recognition systems is hindered due to the lack
of diverse open-source datasets. As a way forward, we have crowdsourced the
Bengali Common Voice Speech Dataset, which is a sentence-level automatic speech
recognition corpus. Collected on the Mozilla Common Voice platform, the dataset
is part of an ongoing campaign that has led to the collection of over 400 hours
of data in 2 months and is growing rapidly. Our analysis shows that this
dataset has more speaker, phoneme, and environmental diversity compared to the
OpenSLR Bengali ASR dataset, the largest existing open-source Bengali speech
dataset. We present insights obtained from the dataset and discuss key
linguistic challenges that need to be addressed in future versions.
Additionally, we report the current performance of a few Automatic Speech
Recognition (ASR) algorithms and set a benchmark for future research.
Related papers
- BIG-C: a Multimodal Multi-Purpose Dataset for Bemba [30.058814706934147]
The dataset is comprised of multi-turn dialogues between Bemba speakers based on images, transcribed and translated into English.
There are more than 92,000 utterances/sentences, amounting to more than 180 hours of audio data with corresponding transcriptions and English translations.
arXiv Detail & Related papers (2023-05-26T18:49:55Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - OOD-Speech: A Large Bengali Speech Recognition Dataset for
Out-of-Distribution Benchmarking [1.277758355297812]
OOD-Speech is the first out-of-distribution benchmarking dataset for Bengali automatic speech recognition (ASR)
Our training dataset is collected via massively online crowdsourcing campaigns which resulted in 1177.94 hours collected and curated from $22,645$ native Bengali speakers from South Asia.
arXiv Detail & Related papers (2023-05-15T18:00:39Z) - An Automatic Speech Recognition System for Bengali Language based on
Wav2Vec2 and Transfer Learning [0.0]
This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework.
The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
arXiv Detail & Related papers (2022-09-16T18:20:16Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.