OOD-Speech: A Large Bengali Speech Recognition Dataset for
Out-of-Distribution Benchmarking
- URL: http://arxiv.org/abs/2305.09688v1
- Date: Mon, 15 May 2023 18:00:39 GMT
- Title: OOD-Speech: A Large Bengali Speech Recognition Dataset for
Out-of-Distribution Benchmarking
- Authors: Fazle Rabbi Rakib, Souhardya Saha Dip, Samiul Alam, Nazia Tasnim, Md.
Istiak Hossain Shihab, Md. Nazmuddoha Ansary, Syed Mobassir Hossen, Marsia
Haque Meghla, Mamunur Mamun, Farig Sadeque, Sayma Sultana Chowdhury, Tahsin
Reasat, Asif Sushmit, Ahmed Imtiaz Humayun
- Abstract summary: OOD-Speech is the first out-of-distribution benchmarking dataset for Bengali automatic speech recognition (ASR)
Our training dataset is collected via massively online crowdsourcing campaigns which resulted in 1177.94 hours collected and curated from $22,645$ native Bengali speakers from South Asia.
- Score: 1.277758355297812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present OOD-Speech, the first out-of-distribution (OOD) benchmarking
dataset for Bengali automatic speech recognition (ASR). Being one of the most
spoken languages globally, Bengali portrays large diversity in dialects and
prosodic features, which demands ASR frameworks to be robust towards
distribution shifts. For example, islamic religious sermons in Bengali are
delivered with a tonality that is significantly different from regular speech.
Our training dataset is collected via massively online crowdsourcing campaigns
which resulted in 1177.94 hours collected and curated from $22,645$ native
Bengali speakers from South Asia. Our test dataset comprises 23.03 hours of
speech collected and manually annotated from 17 different sources, e.g.,
Bengali TV drama, Audiobook, Talk show, Online class, and Islamic sermons to
name a few. OOD-Speech is jointly the largest publicly available speech
dataset, as well as the first out-of-distribution ASR benchmarking dataset for
Bengali.
Related papers
- LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems [16.143694951047024]
We create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases.
We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor.
We train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin.
arXiv Detail & Related papers (2024-08-21T08:51:00Z) - Predicting positive transfer for improved low-resource speech
recognition using acoustic pseudo-tokens [31.83988006684616]
We show that supplementing the target language with data from a similar, higher-resource 'donor' language can help.
For example, continued pre-training on only 10 hours of low-resource Punjabi supplemented with 60 hours of donor Hindi is almost as good as continued pretraining on 70 hours of Punjabi.
arXiv Detail & Related papers (2024-02-03T23:54:03Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - An Automatic Speech Recognition System for Bengali Language based on
Wav2Vec2 and Transfer Learning [0.0]
This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework.
The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
arXiv Detail & Related papers (2022-09-16T18:20:16Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Bengali Common Voice Speech Dataset for Automatic Speech Recognition [0.9218853132156671]
Bengali is one of the most spoken languages in the world with over 300 million speakers globally.
Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets.
We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
arXiv Detail & Related papers (2022-06-28T14:52:08Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain.
This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.