Bangla Natural Language Processing: A Comprehensive Review of Classical,
Machine Learning, and Deep Learning Based Methods
- URL: http://arxiv.org/abs/2105.14875v1
- Date: Mon, 31 May 2021 10:58:58 GMT
- Title: Bangla Natural Language Processing: A Comprehensive Review of Classical,
Machine Learning, and Deep Learning Based Methods
- Authors: Ovishake Sen, Mohtasim Fuad, MD. Nazrul Islam, Jakaria Rabbi, MD.
Kamrul Hasan, Awal Ahmed Fime, Md. Tahmid Hasan Fuad, Delowar Sikder, and MD.
Akil Raihan Iftee
- Abstract summary: The Bangla language is the seventh most spoken language, with 265 million native and non-native speakers worldwide.
English is the predominant language for online resources and technical knowledge, journals, and documentation.
Many efforts are also ongoing to make it easy to use the Bangla language in the online and technical domains.
- Score: 3.441093402715499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Bangla language is the seventh most spoken language, with 265 million
native and non-native speakers worldwide. However, English is the predominant
language for online resources and technical knowledge, journals, and
documentation. Consequently, many Bangla-speaking people, who have limited
command of English, face hurdles to utilize English resources. To bridge the
gap between limited support and increasing demand, researchers conducted many
experiments and developed valuable tools and techniques to create and process
Bangla language materials. Many efforts are also ongoing to make it easy to use
the Bangla language in the online and technical domains. There are some review
papers to understand the past, previous, and future Bangla Natural Language
Processing (BNLP) trends. The studies are mainly concentrated on the specific
domains of BNLP, such as sentiment analysis, speech recognition, optical
character recognition, and text summarization. There is an apparent scarcity of
resources that contain a comprehensive study of the recent BNLP tools and
methods. Therefore, in this paper, we present a thorough review of 71 BNLP
research papers and categorize them into 11 categories, namely Information
Extraction, Machine Translation, Named Entity Recognition, Parsing, Parts of
Speech Tagging, Question Answering System, Sentiment Analysis, Spam and Fake
Detection, Text Summarization, Word Sense Disambiguation, and Speech Processing
and Recognition. We study articles published between 1999 to 2021, and 50\% of
the papers were published after 2015. We discuss Classical, Machine Learning
and Deep Learning approaches with different datasets while addressing the
limitations and current and future trends of the BNLP.
Related papers
- Multilingual Evaluation of Semantic Textual Relatedness [0.0]
Semantic Textual Relatedness (STR) goes beyond superficial word overlap, considering linguistic elements and non-linguistic factors like topic, sentiment, and perspective.
Prior NLP research has predominantly focused on English, limiting its applicability across languages.
We explore STR in Marathi, Hindi, Spanish, and English, unlocking the potential for information retrieval, machine translation, and more.
arXiv Detail & Related papers (2024-04-13T17:16:03Z) - Connecting the Dots: Leveraging Spatio-Temporal Graph Neural Networks
for Accurate Bangla Sign Language Recognition [2.624902795082451]
We present a new word-level Bangla Sign Language dataset - BdSL40 - consisting of 611 videos over 40 words.
This is the first study on word-level BdSL recognition, and the dataset was transcribed from Indian Sign Language (ISL) using the Bangla Sign Language Dictionary (1997).
The study highlights the significant lexical and semantic similarity between BdSL, West Bengal Sign Language, and ISL, and the lack of word-level datasets for BdSL in the literature.
arXiv Detail & Related papers (2024-01-22T18:52:51Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - A Review of Bangla Natural Language Processing Tasks and the Utility of
Transformer Models [2.5768647103950357]
We provide a review of Bangla NLP tasks, resources, and tools available to the research community.
We benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms.
We report our results using both individual and consolidated datasets and provide data for future research.
arXiv Detail & Related papers (2021-07-08T13:49:46Z) - BanglaBERT: Combating Embedding Barrier for Low-Resource Language
Understanding [1.7000879291900044]
We build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet.
Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%.
We identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one.
arXiv Detail & Related papers (2021-01-01T09:28:45Z) - Deep Learning for Text Style Transfer: A Survey [71.8870854396927]
Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text.
We present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017.
We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data.
arXiv Detail & Related papers (2020-11-01T04:04:43Z) - Deep Learning for Hindi Text Classification: A Comparison [6.8629257716723]
The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus.
In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention.
The paper also serves as a tutorial for popular text classification techniques.
arXiv Detail & Related papers (2020-01-19T09:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.