Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated
Translation of Bangla Regional Dialects to Bangla Language
- URL: http://arxiv.org/abs/2311.11142v1
- Date: Sat, 18 Nov 2023 18:36:16 GMT
- Title: Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated
Translation of Bangla Regional Dialects to Bangla Language
- Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Ahmed Al Wase, Mehidi
Ahmmed, Md. Rabius Sani, Tashreef Muhammad
- Abstract summary: There has been a noticeable gap in translating Bangla regional dialects into standard Bangla.
Our aim is to translate these regional dialects into standard Bangla and detect regions accurately.
This is the first large-scale investigation of Bangla regional dialects to Bangla machine translation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Bangla linguistic variety is a fascinating mix of regional dialects that
adds to the cultural diversity of the Bangla-speaking community. Despite
extensive study into translating Bangla to English, English to Bangla, and
Banglish to Bangla in the past, there has been a noticeable gap in translating
Bangla regional dialects into standard Bangla. In this study, we set out to
fill this gap by creating a collection of 32,500 sentences, encompassing
Bangla, Banglish, and English, representing five regional Bangla dialects. Our
aim is to translate these regional dialects into standard Bangla and detect
regions accurately. To achieve this, we proposed models known as mT5 and
BanglaT5 for translating regional dialects into standard Bangla. Additionally,
we employed mBERT and Bangla-bert-base to determine the specific regions from
where these dialects originated. Our experimental results showed the highest
BLEU score of 69.06 for Mymensingh regional dialects and the lowest BLEU score
of 36.75 for Chittagong regional dialects. We also observed the lowest average
word error rate of 0.1548 for Mymensingh regional dialects and the highest of
0.3385 for Chittagong regional dialects. For region detection, we achieved an
accuracy of 85.86% for Bangla-bert-base and 84.36% for mBERT. This is the first
large-scale investigation of Bangla regional dialects to Bangla machine
translation. We believe our findings will not only pave the way for future work
on Bangla regional dialects to Bangla machine translation, but will also be
useful in solving similar language-related challenges in low-resource language
conditions.
Related papers
- Bridging Dialects: Translating Standard Bangla to Regional Variants Using Neural Models [1.472830326343432]
The work is motivated by the need to preserve linguistic diversity and improve communication among dialect speakers.
The models were fine-tuned using the "Vashantor" dataset, containing 32,500 sentences across various dialects.
BanglaT5 demonstrated superior performance with a CER of 12.3% and WER of 15.7%, highlighting its effectiveness in capturing dialectal nuances.
arXiv Detail & Related papers (2025-01-10T06:50:51Z) - BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization [7.059964549363294]
The study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech.
Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools.
Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.
arXiv Detail & Related papers (2024-11-16T20:20:15Z) - BongLLaMA: LLaMA for Bangla Language [0.0]
BongLLaMA is an open-source large language model fine-tuned exclusively on large Bangla corpora and instruction-tuning datasets.
We present our methodology, data augmentation techniques, fine-tuning details, and comprehensive benchmarking results showcasing the utility of BongLLaMA on BLP tasks.
arXiv Detail & Related papers (2024-10-28T16:44:02Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Task-Agnostic Low-Rank Adapters for Unseen English Dialects [52.88554155235167]
Large Language Models (LLMs) are trained on corpora disproportionally weighted in favor of Standard American English.
By disentangling dialect-specific and cross-dialectal information, HyperLoRA improves generalization to unseen dialects in a task-agnostic fashion.
arXiv Detail & Related papers (2023-11-02T01:17:29Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models
for Sentiment Analysis of Bangla Social Media Posts [0.46040036610482665]
This paper presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media Posts) of the BLP Workshop.
Our quantitative results show that transfer learning really helps in better learning of the models in this low-resource language scenario.
We obtain a micro-F1 of 67.02% on the test set and our performance in this shared task is ranked at 21 in the leaderboard.
arXiv Detail & Related papers (2023-10-13T16:46:38Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.