BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla
Lemmatizer
- URL: http://arxiv.org/abs/2311.03078v1
- Date: Mon, 6 Nov 2023 13:02:07 GMT
- Title: BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla
Lemmatizer
- Authors: Sadia Afrin, Md. Shahad Mahmud Chowdhury, Md. Ekramul Islam, Faisal
Ahamed Khan, Labib Imam Chowdhury, MD. Motahar Mahtab, Nazifa Nuha Chowdhury,
Massud Forkan, Neelima Kundu, Hakim Arif, Mohammad Mamun Or Rashid, Mohammad
Ruhul Amin, Nabeel Mohammed
- Abstract summary: We propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer for Bangla.
Our system aims to lemmatize words based on their parts of speech class within a given sentence.
The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained.
- Score: 3.1742013359102175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lemmatization holds significance in both natural language processing (NLP)
and linguistics, as it effectively decreases data density and aids in
comprehending contextual meaning. However, due to the highly inflected nature
and morphological richness, lemmatization in Bangla text poses a complex
challenge. In this study, we propose linguistic rules for lemmatization and
utilize a dictionary along with the rules to design a lemmatizer specifically
for Bangla. Our system aims to lemmatize words based on their parts of speech
class within a given sentence. Unlike previous rule-based approaches, we
analyzed the suffix marker occurrence according to the morpho-syntactic values
and then utilized sequences of suffix markers instead of entire suffixes. To
develop our rules, we analyze a large corpus of Bangla text from various
domains, sources, and time periods to observe the word formation of inflected
words. The lemmatizer achieves an accuracy of 96.36% when tested against a
manually annotated test dataset by trained linguists and demonstrates
competitive performance on three previously published Bangla lemmatization
datasets. We are making the code and datasets publicly available at
https://github.com/eblict-gigatech/BanLemma in order to contribute to the
further advancement of Bangla NLP.
Related papers
- BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla [0.0]
We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset comprising 37.3k samples.
The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups.
Experiments reveal that our further pre-trained encoders are achieving state-of-the-art performance on the BanTH dataset.
arXiv Detail & Related papers (2024-10-17T07:15:15Z) - Urdu Dependency Parsing and Treebank Development: A Syntactic and Morphological Perspective [0.0]
We use dependency parsing to analyze news articles in Urdu.
We achieve a best-labeled accuracy (LA) of 70% and an unlabeled attachment score (UAS) of 84%.
arXiv Detail & Related papers (2024-06-13T19:30:32Z) - Homonym Sense Disambiguation in the Georgian Language [49.1574468325115]
This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Georgian language.
It is based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on a dataset formed by filtering the Georgian Common Crawls corpus.
arXiv Detail & Related papers (2024-04-24T21:48:43Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - Suffix Retrieval-Augmented Language Modeling [1.8710230264817358]
Causal language modeling (LM) uses word history to predict the next word.
BERT, on the other hand, makes use of bi-directional word information in a sentence to predict words at masked positions.
We propose a novel model that simulates a bi-directional contextual effect in an autoregressive manner.
arXiv Detail & Related papers (2022-11-06T07:53:19Z) - Word Order Does Matter (And Shuffled Language Models Know It) [9.990431777927421]
Recent studies have shown that language models pretrained and/or fine-tuned on randomly permuted sentences exhibit competitive performance on GLUE.
We investigate what position embeddings learned from shuffled text encode, showing that these models retain information pertaining to the original, naturalistic word order.
arXiv Detail & Related papers (2022-03-21T14:10:15Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Match-Ignition: Plugging PageRank into Transformer for Long-form Text
Matching [66.71886789848472]
We propose a novel hierarchical noise filtering model, namely Match-Ignition, to tackle the effectiveness and efficiency problem.
The basic idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information.
Noisy sentences are usually easy to detect because the sentence is the basic unit of a long-form text, so we directly use PageRank to filter such information.
arXiv Detail & Related papers (2021-01-16T10:34:03Z) - Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention [19.520840812910357]
Sindhi word segmentation is a challenging task due to space omission and insertion issues.
Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features.
We propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task.
arXiv Detail & Related papers (2020-12-30T08:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.