Related papers: Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla

Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla

URL: http://arxiv.org/abs/2505.18709v1
Date: Sat, 24 May 2025 14:13:45 GMT
Title: Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla
Authors: Sourav Kumar Das, Md. Julkar Naeen, MD. Jahidul Islam, Md. Anisul Haque Sajeeb, Narayan Ranjan Chakraborty, Mayen Uddin Mojumdar,
Abstract summary: Every division of Bangladesh has its own local language like Sylheti, Chittagong etc.<n>This research is for the local language and this particular paper is on Sylheti language.<n>It presented a comprehensive system using Natural Language Processing or NLP techniques for translating Pure or Modern Bangla to locally spoken Sylheti Bangla language.
Score: 3.11717505289722
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bangla or Bengali is the national language of Bangladesh, people from different regions don't talk in proper Bangla. Every division of Bangladesh has its own local language like Sylheti, Chittagong etc. In recent years some papers were published on Bangla language like sentiment analysis, fake news detection and classifications, but a few of them were on Bangla languages. This research is for the local language and this particular paper is on Sylheti language. It presented a comprehensive system using Natural Language Processing or NLP techniques for translating Pure or Modern Bangla to locally spoken Sylheti Bangla language. Total 1200 data used for training 3 models LSTM, Bi-LSTM and Seq2Seq and LSTM scored the best in performance with 89.3% accuracy. The findings of this research may contribute to the growth of Bangla NLP researchers for future more advanced innovations.

Related papers

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language [1.4206084598312039]
As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models.<n>We introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers.<n>Experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma.
arXiv Detail & Related papers (2025-10-10T06:07:14Z)
TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla [37.210208249613]
Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs)<n>This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models.<n>We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant 11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs.
arXiv Detail & Related papers (2025-09-11T02:25:49Z)
LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages [45.640417004733166]
We introduce LoraxBench, a benchmark that focuses on low-resource languages of Indonesia.<n>Our dataset covers 20 languages, with the addition of two formality registers for three languages.<n>We show that a change in register affects model performance, especially with registers not commonly found in social media.
arXiv Detail & Related papers (2025-08-17T18:07:57Z)
BongLLaMA: LLaMA for Bangla Language [0.0]
BongLLaMA is an open-source large language model fine-tuned exclusively on large Bangla corpora and instruction-tuning datasets. We present our methodology, data augmentation techniques, fine-tuning details, and comprehensive benchmarking results showcasing the utility of BongLLaMA on BLP tasks.
arXiv Detail & Related papers (2024-10-28T16:44:02Z)
Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language [0.0]
There has been a noticeable gap in translating Bangla regional dialects into standard Bangla. Our aim is to translate these regional dialects into standard Bangla and detect regions accurately. This is the first large-scale investigation of Bangla regional dialects to Bangla machine translation.
arXiv Detail & Related papers (2023-11-18T18:36:16Z)
BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models for Sentiment Analysis of Bangla Social Media Posts [0.46040036610482665]
This paper presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media Posts) of the BLP Workshop. Our quantitative results show that transfer learning really helps in better learning of the models in this low-resource language scenario. We obtain a micro-F1 of 67.02% on the test set and our performance in this shared task is ranked at 21 in the leaderboard.
arXiv Detail & Related papers (2023-10-13T16:46:38Z)
A Benchmark for Learning to Translate a New Language from One Grammar Book [41.1108119653453]
MTOB is a benchmark for learning to translate between English and Kalamang. It asks a model to learn a language from a single human-readable book of grammar explanations. We demonstrate that baselines using current LLMs are promising but fall short of human performance.
arXiv Detail & Related papers (2023-09-28T16:32:28Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Baichuan 2: Open Large-scale Language Models [51.34140526283222]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.<n>Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z)
On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings. Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z)
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA) We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z)
Pretrained Models for Multilingual Federated Learning [38.19507070702635]
We study how multilingual text impacts Federated Learning (FL) algorithms. We explore three multilingual language tasks, language modeling, machine translation, and text classification. Our results show that using pretrained models reduces the negative effects of FL, helping them to perform near or better than centralized (no privacy) learning.
arXiv Detail & Related papers (2022-06-06T00:20:30Z)
Transferring Knowledge Distillation for Multilingual Social Event Detection [42.663309895263666]
Recently published graph neural networks (GNNs) show promising performance at social event detection tasks. We present a GNN that incorporates cross-lingual word embeddings for detecting events in multilingual data streams. Experiments on both synthetic and real-world datasets show the framework to be highly effective at detection in both multilingual data and in languages where training samples are scarce.
arXiv Detail & Related papers (2021-08-06T12:38:42Z)
End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents [0.43012765978447565]
We propose a novel approach to build a business assistant which can communicate in Bangla and Bangla Transliteration in English with high confidence consistently. We use Rasa Open Source Framework, fastText embeddings, Polyglot embeddings, Flask, and other systems as building blocks. We present a pipeline for intent classification and entity extraction which achieves reasonable performance.
arXiv Detail & Related papers (2021-07-12T16:09:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.