Enhancing Bangla Language Next Word Prediction and Sentence Completion through Extended RNN with Bi-LSTM Model On N-gram Language
- URL: http://arxiv.org/abs/2405.01873v1
- Date: Fri, 3 May 2024 06:06:01 GMT
- Title: Enhancing Bangla Language Next Word Prediction and Sentence Completion through Extended RNN with Bi-LSTM Model On N-gram Language
- Authors: Md Robiul Islam, Al Amin, Aniqua Nusrat Zereen,
- Abstract summary: This paper introduces a Bi-LSTM model that effectively handles Bangla next-word prediction and Bangla sentence generation.
We constructed a corpus dataset from various news portals, including bdnews24, BBC News Bangla, and Prothom Alo.
The proposed approach achieved superior results in word prediction, reaching 99% accuracy for both 4-gram and 5-gram word predictions.
- Score: 1.3693860189056777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Texting stands out as the most prominent form of communication worldwide. Individual spend significant amount of time writing whole texts to send emails or write something on social media, which is time consuming in this modern era. Word prediction and sentence completion will be suitable and appropriate in the Bangla language to make textual information easier and more convenient. This paper expands the scope of Bangla language processing by introducing a Bi-LSTM model that effectively handles Bangla next-word prediction and Bangla sentence generation, demonstrating its versatility and potential impact. We proposed a new Bi-LSTM model to predict a following word and complete a sentence. We constructed a corpus dataset from various news portals, including bdnews24, BBC News Bangla, and Prothom Alo. The proposed approach achieved superior results in word prediction, reaching 99\% accuracy for both 4-gram and 5-gram word predictions. Moreover, it demonstrated significant improvement over existing methods, achieving 35\%, 75\%, and 95\% accuracy for uni-gram, bi-gram, and tri-gram word prediction, respectively
Related papers
- BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla
Lemmatizer [3.1742013359102175]
We propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer for Bangla.
Our system aims to lemmatize words based on their parts of speech class within a given sentence.
The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained.
arXiv Detail & Related papers (2023-11-06T13:02:07Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - Pretraining without Wordpieces: Learning Over a Vocabulary of Millions
of Words [50.11559460111882]
We explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces.
Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension.
Since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets.
arXiv Detail & Related papers (2022-02-24T15:15:48Z) - SimulSLT: End-to-End Simultaneous Sign Language Translation [55.54237194555432]
Existing sign language translation methods need to read all the videos before starting the translation.
We propose SimulSLT, the first end-to-end simultaneous sign language translation model.
SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
arXiv Detail & Related papers (2021-12-08T11:04:52Z) - HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech
Detection in Bangla [2.055204980188575]
In this paper, we present HS-BAN, a binary class hate speech dataset in Bangla language consisting of more than 50,000 labeled comments.
We explore traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection.
Our benchmark shows that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78% F1-score.
arXiv Detail & Related papers (2021-12-03T13:35:18Z) - End-to-End Natural Language Understanding Pipeline for Bangla
Conversational Agents [0.43012765978447565]
We propose a novel approach to build a business assistant which can communicate in Bangla and Bangla Transliteration in English with high confidence consistently.
We use Rasa Open Source Framework, fastText embeddings, Polyglot embeddings, Flask, and other systems as building blocks.
We present a pipeline for intent classification and entity extraction which achieves reasonable performance.
arXiv Detail & Related papers (2021-07-12T16:09:22Z) - Simple or Complex? Learning to Predict Readability of Bengali Texts [6.860272388539321]
We present a readability analysis tool capable of analyzing text written in the Bengali language.
Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing.
arXiv Detail & Related papers (2020-12-09T01:41:35Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.