Related papers: BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects

BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects

URL: http://arxiv.org/abs/2510.06188v1
Date: Tue, 07 Oct 2025 17:47:39 GMT
Title: BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects
Authors: Jakir Hasan, Shubhashis Roy Dipta,
Abstract summary: We present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects.<n> BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication.<n>It can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers.

Related papers

An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization [0.0]
This paper presents a multistage approach developed for the "DL Sprint 4.0 - Bengali Long-Form Speech Recognition" and "DL Sprint 4.0 - Bengali Speaker Diarization" competitions on Kaggle.<n>We implemented Whisper Medium fine-tuned on Bengali data for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model.<n>Results show that targeted tuning and strategic data utilization can significantly improve AI for South Asian languages.
arXiv Detail & Related papers (2026-03-03T17:00:42Z)
A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment [0.0]
This paper presents a robust framework specifically engineered for extended Bangla content.<n>Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation.<n>By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.
arXiv Detail & Related papers (2026-02-26T12:26:04Z)
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation [111.94720088481614]
Can multimodal generative models effectively produce content given dialectal textual input?<n>We construct a new large-scale benchmark spanning six common English dialects.<n>We design a general encoder-based mitigation strategy for multimodal generative models.
arXiv Detail & Related papers (2025-10-16T17:56:55Z)
A2TTS: TTS for Low Resource Indian Languages [16.782842482372427]
We present a speaker conditioned text-to-speech (TTS) system aimed at generating speech for unseen speakers.<n>Using a diffusion-based TTS architecture, a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation.<n>We employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing.
arXiv Detail & Related papers (2025-07-21T06:20:27Z)
BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization [7.059964549363294]
The study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech. Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools. Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.
arXiv Detail & Related papers (2024-11-16T20:20:15Z)
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z)
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z)
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs. XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z)
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z)
OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking [1.277758355297812]
OOD-Speech is the first out-of-distribution benchmarking dataset for Bengali automatic speech recognition (ASR) Our training dataset is collected via massively online crowdsourcing campaigns which resulted in 1177.94 hours collected and curated from $22,645$ native Bengali speakers from South Asia.
arXiv Detail & Related papers (2023-05-15T18:00:39Z)
Bangla-Wave: Improving Bangla Automatic Speech Recognition Utilizing N-gram Language Models [0.0]
We show how to significantly improve the performance of an ASR model by adding an n-gram language model as a post-processor. We generate a robust Bangla ASR model that is better than the existing ASR models.
arXiv Detail & Related papers (2022-09-13T17:59:21Z)
Bengali Common Voice Speech Dataset for Automatic Speech Recognition [0.9218853132156671]
Bengali is one of the most spoken languages in the world with over 300 million speakers globally. Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets. We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
arXiv Detail & Related papers (2022-06-28T14:52:08Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.