RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects
- URL: http://arxiv.org/abs/2510.24096v1
- Date: Tue, 28 Oct 2025 06:08:42 GMT
- Title: RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects
- Authors: Md. Rezuwan Hassan, Azmol Hossain, Kanij Fatema, Rubayet Sabbir Faruque, Tanmoy Shome, Ruwad Naswan, Trina Chakraborty, Md. Foriduzzaman Zihad, Tawsif Tashwar Dipto, Nazia Tasnim, Nazmuddoha Ansary, Md. Mehedi Hasan Shawon, Ahmed Imtiaz Humayun, Md. Golam Rabiul Alam, Farig Sadeque, Asif Sushmit,
- Abstract summary: The Bengali language is spoken extensively across South Asia and among diasporic communities.<n>Five principal dialect groups are identified: Eastern Bengali, Manbhumi, Rangpuri, Varendri, and Rarhi.<n>Research on the computational processing of Bengali dialects remains limited.
- Score: 5.805745873296805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Bengali language, spoken extensively across South Asia and among diasporic communities, exhibits considerable dialectal diversity shaped by geography, culture, and history. Phonological and pronunciation-based classifications broadly identify five principal dialect groups: Eastern Bengali, Manbhumi, Rangpuri, Varendri, and Rarhi. Within Bangladesh, further distinctions emerge through variation in vocabulary, syntax, and morphology, as observed in regions such as Chittagong, Sylhet, Rangpur, Rajshahi, Noakhali, and Barishal. Despite this linguistic richness, systematic research on the computational processing of Bengali dialects remains limited. This study seeks to document and analyze the phonetic and morphological properties of these dialects while exploring the feasibility of building computational models particularly Automatic Speech Recognition (ASR) systems tailored to regional varieties. Such efforts hold potential for applications in virtual assistants and broader language technologies, contributing to both the preservation of dialectal diversity and the advancement of inclusive digital tools for Bengali-speaking communities. The dataset created for this study is released for public use.
Related papers
- Bridging Dialects: Translating Standard Bangla to Regional Variants Using Neural Models [1.472830326343432]
The work is motivated by the need to preserve linguistic diversity and improve communication among dialect speakers.<n>The models were fine-tuned using the "Vashantor" dataset, containing 32,500 sentences across various dialects.<n> BanglaT5 demonstrated superior performance with a CER of 12.3% and WER of 15.7%, highlighting its effectiveness in capturing dialectal nuances.
arXiv Detail & Related papers (2025-01-10T06:50:51Z) - Unification of Balti and trans-border sister dialects in the essence of LLMs and AI Technology [19.282867207168565]
The language Balti belongs to the Sino-Tibetan, specifically the Tibeto-Burman language family.
It is understood with variations, across populations in India, China, Pakistan, Nepal, Tibet, Burma, and Bhutan.
Considering the diverse cultural, socio-political, religious, and geographical impacts, it is important to step forward unifying the dialects.
arXiv Detail & Related papers (2024-11-20T15:48:21Z) - BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization [7.059964549363294]
The study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech.
Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools.
Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.
arXiv Detail & Related papers (2024-11-16T20:20:15Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Comparing Spoken Languages using Paninian System of Sounds and Finite State Machines [0.0]
We propose an Ecosystem Model for Linguistic Development with Sanskrit at the core.<n>We represent words across languages as state transitions on the phonetic map and construct corresponding Morphological Finite Automata.
arXiv Detail & Related papers (2023-01-29T15:22:10Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Speaker Recognition in Bengali Language from Nonlinear Features [0.0]
The study of Bengali speech recognition and speaker identification is scarce in the literature.
In this work, we have extracted some acoustic features of speech using non linear multifractal analysis.
The Multifractal Detrended Fluctuation Analysis reveals essentially the complexity associated with the speech signals taken.
arXiv Detail & Related papers (2020-04-15T22:38:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.