A Review of Bangla Natural Language Processing Tasks and the Utility of
Transformer Models
- URL: http://arxiv.org/abs/2107.03844v2
- Date: Sun, 11 Jul 2021 06:43:33 GMT
- Title: A Review of Bangla Natural Language Processing Tasks and the Utility of
Transformer Models
- Authors: Firoj Alam, Arid Hasan, Tanvirul Alam, Akib Khan, Janntatul Tajrin,
Naira Khan, Shammur Absar Chowdhury
- Abstract summary: We provide a review of Bangla NLP tasks, resources, and tools available to the research community.
We benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms.
We report our results using both individual and consolidated datasets and provide data for future research.
- Score: 2.5768647103950357
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Bangla -- ranked as the 6th most widely spoken language across the world
(https://www.ethnologue.com/guides/ethnologue200), with 230 million native
speakers -- is still considered as a low-resource language in the natural
language processing (NLP) community. With three decades of research, Bangla NLP
(BNLP) is still lagging behind mainly due to the scarcity of resources and the
challenges that come with it. There is sparse work in different areas of BNLP;
however, a thorough survey reporting previous work and recent advances is yet
to be done. In this study, we first provide a review of Bangla NLP tasks,
resources, and tools available to the research community; we benchmark datasets
collected from various platforms for nine NLP tasks using current
state-of-the-art algorithms (i.e., transformer-based models). We provide
comparative results for the studied NLP tasks by comparing monolingual vs.
multilingual models of varying sizes. We report our results using both
individual and consolidated datasets and provide data splits for future
research. We reviewed a total of 108 papers and conducted 175 sets of
experiments. Our results show promising performance using transformer-based
models while highlighting the trade-off with computational costs. We hope that
such a comprehensive survey will motivate the community to build on and further
advance the research on Bangla NLP.
Related papers
- The Nature of NLP: Analyzing Contributions in NLP Papers [77.31665252336157]
We quantitatively investigate what constitutes NLP research by examining research papers.
Our findings reveal a rising involvement of machine learning in NLP since the early nineties.
In post-2020, there has been a resurgence of focus on language and people.
arXiv Detail & Related papers (2024-09-29T01:29:28Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models
for Sentiment Analysis of Bangla Social Media Posts [0.46040036610482665]
This paper presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media Posts) of the BLP Workshop.
Our quantitative results show that transfer learning really helps in better learning of the models in this low-resource language scenario.
We obtain a micro-F1 of 67.02% on the test set and our performance in this shared task is ranked at 21 in the leaderboard.
arXiv Detail & Related papers (2023-10-13T16:46:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - LaoPLM: Pre-trained Language Models for Lao [3.2146309563776416]
Pre-trained language models (PLMs) can capture different levels of concepts in context and hence generate universal language representations.
Although PTMs have been widely used in most NLP applications, it is under-represented in Lao NLP research.
We construct a text classification dataset to alleviate the resource-scare situation of the Lao language.
We present the first transformer-based PTMs for Lao with four versions: BERT-small, BERT-base, ELECTRA-small and ELECTRA-base, and evaluate it over two downstream tasks: part-of-speech tagging and text classification.
arXiv Detail & Related papers (2021-10-12T11:13:07Z) - FedNLP: A Research Platform for Federated Learning in Natural Language
Processing [55.01246123092445]
We present the FedNLP, a research platform for federated learning in NLP.
FedNLP supports various popular task formulations in NLP such as text classification, sequence tagging, question answering, seq2seq generation, and language modeling.
Preliminary experiments with FedNLP reveal that there exists a large performance gap between learning on decentralized and centralized datasets.
arXiv Detail & Related papers (2021-04-18T11:04:49Z) - BanglaBERT: Combating Embedding Barrier for Low-Resource Language
Understanding [1.7000879291900044]
We build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet.
Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%.
We identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one.
arXiv Detail & Related papers (2021-01-01T09:28:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.