Stemming -- The Evolution and Current State with a Focus on Bangla
- URL: http://arxiv.org/abs/2508.15711v1
- Date: Thu, 21 Aug 2025 16:54:24 GMT
- Title: Stemming -- The Evolution and Current State with a Focus on Bangla
- Authors: Abhijit Paul, Mashiat Amin Farin, Sharif Md. Abdullah, Ahmedul Kabir, Zarif Masud, Shebuti Rayana,
- Abstract summary: Bangla, the seventh most widely spoken language worldwide, faces digital under-representation due to limited resources and lack of annotated datasets.<n>This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively.<n>It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.
- Score: 0.02199065293049185
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Bangla, the seventh most widely spoken language worldwide with 300 million native speakers, faces digital under-representation due to limited resources and lack of annotated datasets. Stemming, a critical preprocessing step in language analysis, is essential for low-resource, highly-inflectional languages like Bangla, because it can reduce the complexity of algorithms and models by significantly reducing the number of words the algorithm needs to consider. This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively. While exploring the landscape of Bangla stemming, it becomes evident that there is a significant gap in the existing literature. The paper highlights the discontinuity from previous research and the scarcity of accessible implementations for replication. Furthermore, it critiques the evaluation methodologies, stressing the need for more relevant metrics. In the context of Bangla's rich morphology and diverse dialects, the paper acknowledges the challenges it poses. To address these challenges, the paper suggests directions for Bangla stemmer development. It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.
Related papers
- BOISHOMMO: Holistic Approach for Bangla Hate Speech [0.0]
Comprehensive datasets are the main problem among the constrained resource languages, such as Bangla.<n>With over two thousand annotated examples, BOISHOMMO provides a nuanced understanding of hate speech in Bangla.
arXiv Detail & Related papers (2025-04-11T10:14:40Z) - Summarizing Speech: A Comprehensive Survey [76.13011304983458]
Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content.<n>This survey examines existing datasets and evaluation protocols, which are crucial for assessing the quality of summarization approaches.
arXiv Detail & Related papers (2025-04-10T17:50:53Z) - Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects [0.6554326244334868]
This review highlights the scarcity of annotated corpora, limited availability of pre-trained language models, and linguistic challenges like tonal complexity and diacritic dependency as significant obstacles.<n>The findings reveal a growing body of multilingual and monolingual resources, even though the field is constrained by socio-cultural factors such as code-switching and desertion of language for digital usage.
arXiv Detail & Related papers (2025-02-24T17:41:48Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Robust Sentiment Analysis for Low Resource languages Using Data
Augmentation Approaches: A Case Study in Marathi [0.9553673944187253]
Sentiment analysis plays a crucial role in understanding the sentiment expressed in text data.
There exists a significant gap in research efforts for sentiment analysis in low-resource languages.
We present an exhaustive study of data augmentation approaches for the low-resource Indic language Marathi.
arXiv Detail & Related papers (2023-10-01T17:09:31Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Analyzing and Reducing the Performance Gap in Cross-Lingual Transfer
with Fine-tuning Slow and Fast [50.19681990847589]
Existing research has shown that a multilingual pre-trained language model fine-tuned with one (source) language also performs well on downstream tasks for non-source languages.
This paper analyzes the fine-tuning process, discovers when the performance gap changes and identifies which network weights affect the overall performance most.
arXiv Detail & Related papers (2023-05-19T06:04:21Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Bangla Natural Language Processing: A Comprehensive Review of Classical,
Machine Learning, and Deep Learning Based Methods [3.441093402715499]
The Bangla language is the seventh most spoken language, with 265 million native and non-native speakers worldwide.
English is the predominant language for online resources and technical knowledge, journals, and documentation.
Many efforts are also ongoing to make it easy to use the Bangla language in the online and technical domains.
arXiv Detail & Related papers (2021-05-31T10:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.