Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla
- URL: http://arxiv.org/abs/2505.18709v1
- Date: Sat, 24 May 2025 14:13:45 GMT
- Title: Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla
- Authors: Sourav Kumar Das, Md. Julkar Naeen, MD. Jahidul Islam, Md. Anisul Haque Sajeeb, Narayan Ranjan Chakraborty, Mayen Uddin Mojumdar,
- Abstract summary: Every division of Bangladesh has its own local language like Sylheti, Chittagong etc.<n>This research is for the local language and this particular paper is on Sylheti language.<n>It presented a comprehensive system using Natural Language Processing or NLP techniques for translating Pure or Modern Bangla to locally spoken Sylheti Bangla language.
- Score: 3.11717505289722
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bangla or Bengali is the national language of Bangladesh, people from different regions don't talk in proper Bangla. Every division of Bangladesh has its own local language like Sylheti, Chittagong etc. In recent years some papers were published on Bangla language like sentiment analysis, fake news detection and classifications, but a few of them were on Bangla languages. This research is for the local language and this particular paper is on Sylheti language. It presented a comprehensive system using Natural Language Processing or NLP techniques for translating Pure or Modern Bangla to locally spoken Sylheti Bangla language. Total 1200 data used for training 3 models LSTM, Bi-LSTM and Seq2Seq and LSTM scored the best in performance with 89.3% accuracy. The findings of this research may contribute to the growth of Bangla NLP researchers for future more advanced innovations.
Related papers
- BongLLaMA: LLaMA for Bangla Language [0.0]
BongLLaMA is an open-source large language model fine-tuned exclusively on large Bangla corpora and instruction-tuning datasets.
We present our methodology, data augmentation techniques, fine-tuning details, and comprehensive benchmarking results showcasing the utility of BongLLaMA on BLP tasks.
arXiv Detail & Related papers (2024-10-28T16:44:02Z) - Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated
Translation of Bangla Regional Dialects to Bangla Language [0.0]
There has been a noticeable gap in translating Bangla regional dialects into standard Bangla.
Our aim is to translate these regional dialects into standard Bangla and detect regions accurately.
This is the first large-scale investigation of Bangla regional dialects to Bangla machine translation.
arXiv Detail & Related papers (2023-11-18T18:36:16Z) - BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models
for Sentiment Analysis of Bangla Social Media Posts [0.46040036610482665]
This paper presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media Posts) of the BLP Workshop.
Our quantitative results show that transfer learning really helps in better learning of the models in this low-resource language scenario.
We obtain a micro-F1 of 67.02% on the test set and our performance in this shared task is ranked at 21 in the leaderboard.
arXiv Detail & Related papers (2023-10-13T16:46:38Z) - A Benchmark for Learning to Translate a New Language from One Grammar
Book [41.1108119653453]
MTOB is a benchmark for learning to translate between English and Kalamang.
It asks a model to learn a language from a single human-readable book of grammar explanations.
We demonstrate that baselines using current LLMs are promising but fall short of human performance.
arXiv Detail & Related papers (2023-09-28T16:32:28Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Baichuan 2: Open Large-scale Language Models [51.34140526283222]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.<n>Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Pretrained Models for Multilingual Federated Learning [38.19507070702635]
We study how multilingual text impacts Federated Learning (FL) algorithms.
We explore three multilingual language tasks, language modeling, machine translation, and text classification.
Our results show that using pretrained models reduces the negative effects of FL, helping them to perform near or better than centralized (no privacy) learning.
arXiv Detail & Related papers (2022-06-06T00:20:30Z) - Transferring Knowledge Distillation for Multilingual Social Event
Detection [42.663309895263666]
Recently published graph neural networks (GNNs) show promising performance at social event detection tasks.
We present a GNN that incorporates cross-lingual word embeddings for detecting events in multilingual data streams.
Experiments on both synthetic and real-world datasets show the framework to be highly effective at detection in both multilingual data and in languages where training samples are scarce.
arXiv Detail & Related papers (2021-08-06T12:38:42Z) - End-to-End Natural Language Understanding Pipeline for Bangla
Conversational Agents [0.43012765978447565]
We propose a novel approach to build a business assistant which can communicate in Bangla and Bangla Transliteration in English with high confidence consistently.
We use Rasa Open Source Framework, fastText embeddings, Polyglot embeddings, Flask, and other systems as building blocks.
We present a pipeline for intent classification and entity extraction which achieves reasonable performance.
arXiv Detail & Related papers (2021-07-12T16:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.