Bridging Dialects: Translating Standard Bangla to Regional Variants Using Neural Models
- URL: http://arxiv.org/abs/2501.05749v1
- Date: Fri, 10 Jan 2025 06:50:51 GMT
- Title: Bridging Dialects: Translating Standard Bangla to Regional Variants Using Neural Models
- Authors: Md. Arafat Alam Khandaker, Ziyan Shirin Raha, Bidyarthi Paul, Tashreef Muhammad,
- Abstract summary: The work is motivated by the need to preserve linguistic diversity and improve communication among dialect speakers.
The models were fine-tuned using the "Vashantor" dataset, containing 32,500 sentences across various dialects.
BanglaT5 demonstrated superior performance with a CER of 12.3% and WER of 15.7%, highlighting its effectiveness in capturing dialectal nuances.
- Score: 1.472830326343432
- License:
- Abstract: The Bangla language includes many regional dialects, adding to its cultural richness. The translation of Bangla Language into regional dialects presents a challenge due to significant variations in vocabulary, pronunciation, and sentence structure across regions like Chittagong, Sylhet, Barishal, Noakhali, and Mymensingh. These dialects, though vital to local identities, lack of representation in technological applications. This study addresses this gap by translating standard Bangla into these dialects using neural machine translation (NMT) models, including BanglaT5, mT5, and mBART50. The work is motivated by the need to preserve linguistic diversity and improve communication among dialect speakers. The models were fine-tuned using the "Vashantor" dataset, containing 32,500 sentences across various dialects, and evaluated through Character Error Rate (CER) and Word Error Rate (WER) metrics. BanglaT5 demonstrated superior performance with a CER of 12.3% and WER of 15.7%, highlighting its effectiveness in capturing dialectal nuances. The outcomes of this research contribute to the development of inclusive language technologies that support regional dialects and promote linguistic diversity.
Related papers
- BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization [7.059964549363294]
The study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech.
Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools.
Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.
arXiv Detail & Related papers (2024-11-16T20:20:15Z) - Exploring Diachronic and Diatopic Changes in Dialect Continua: Tasks, Datasets and Challenges [2.572144535177391]
We critically assess nine tasks and datasets across five dialects from three language families (Slavic, Romance, and Germanic)
We outline five open challenges regarding changes in dialect use over time, the reliability of dialect datasets, the importance of speaker characteristics, limited coverage of dialects, and ethical considerations in data collection.
We hope that our work sheds light on future research towards inclusive computational methods and datasets for language varieties and dialects.
arXiv Detail & Related papers (2024-07-04T15:38:38Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Transcribing Bengali Text with Regional Dialects to IPA using District Guided Tokens [0.0]
This paper introduces the District Guided Tokens (DGT) technique on a new dataset spanning six districts of Bangladesh.
The DGT technique is applied to fine-tune several transformer-based models, on this new dataset.
Experimental results demonstrate the effectiveness of DGT, with the ByT5 model achieving superior performance over word-based models.
arXiv Detail & Related papers (2024-03-26T05:55:21Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated
Translation of Bangla Regional Dialects to Bangla Language [0.0]
There has been a noticeable gap in translating Bangla regional dialects into standard Bangla.
Our aim is to translate these regional dialects into standard Bangla and detect regions accurately.
This is the first large-scale investigation of Bangla regional dialects to Bangla machine translation.
arXiv Detail & Related papers (2023-11-18T18:36:16Z) - Task-Agnostic Low-Rank Adapters for Unseen English Dialects [52.88554155235167]
Large Language Models (LLMs) are trained on corpora disproportionally weighted in favor of Standard American English.
By disentangling dialect-specific and cross-dialectal information, HyperLoRA improves generalization to unseen dialects in a task-agnostic fashion.
arXiv Detail & Related papers (2023-11-02T01:17:29Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Multi-VALUE: A Framework for Cross-Dialectal English NLP [49.55176102659081]
Multi- Dialect is a controllable rule-based translation system spanning 50 English dialects.
Stress tests reveal significant performance disparities for leading models on non-standard dialects.
We partner with native speakers of Chicano and Indian English to release new gold-standard variants of the popular CoQA task.
arXiv Detail & Related papers (2022-12-15T18:17:01Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.