An Automatic Speech Recognition System for Bengali Language based on
Wav2Vec2 and Transfer Learning
- URL: http://arxiv.org/abs/2209.08119v2
- Date: Tue, 20 Sep 2022 02:22:56 GMT
- Title: An Automatic Speech Recognition System for Bengali Language based on
Wav2Vec2 and Transfer Learning
- Authors: Tushar Talukder Showrav
- Abstract summary: This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework.
The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An independent, automated method of decoding and transcribing oral speech is
known as automatic speech recognition (ASR). A typical ASR system extracts
feature from audio recordings or streams and run one or more algorithms to map
the features to corresponding texts. Numerous of research has been done in the
field of speech signal processing in recent years. When given adequate
resources, both conventional ASR and emerging end-to-end (E2E) speech
recognition have produced promising results. However, for low-resource
languages like Bengali, the current state of ASR lags behind, although the low
resource state does not reflect upon the fact that this language is spoken by
over 500 million people all over the world. Despite its popularity, there
aren't many diverse open-source datasets available, which makes it difficult to
conduct research on Bengali speech recognition systems. This paper is a part of
the competition named `BUET CSE Fest DL Sprint'. The purpose of this paper is
to improve the speech recognition performance of the Bengali language by
adopting speech recognition technology on the E2E structure based on the
transfer learning framework. The proposed method effectively models the Bengali
language and achieves 3.819 score in `Levenshtein Mean Distance' on the test
dataset of 7747 samples, when only 1000 samples of train dataset were used to
train.
Related papers
- Automatic Speech Recognition for Hindi [0.6292138336765964]
The research involved developing a web application and designing a web interface for speech recognition.
The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts.
The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
arXiv Detail & Related papers (2024-06-26T07:39:20Z) - GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.
It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z) - Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach [0.6445605125467574]
This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks.
The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments.
We propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training.
arXiv Detail & Related papers (2024-06-03T15:38:40Z) - Predicting positive transfer for improved low-resource speech
recognition using acoustic pseudo-tokens [31.83988006684616]
We show that supplementing the target language with data from a similar, higher-resource 'donor' language can help.
For example, continued pre-training on only 10 hours of low-resource Punjabi supplemented with 60 hours of donor Hindi is almost as good as continued pretraining on 70 hours of Punjabi.
arXiv Detail & Related papers (2024-02-03T23:54:03Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Bengali Common Voice Speech Dataset for Automatic Speech Recognition [0.9218853132156671]
Bengali is one of the most spoken languages in the world with over 300 million speakers globally.
Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets.
We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
arXiv Detail & Related papers (2022-06-28T14:52:08Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.