Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval
- URL: http://arxiv.org/abs/2406.11029v1
- Date: Sun, 16 Jun 2024 17:59:05 GMT
- Title: Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval
- Authors: Rohan Chavan, Gaurav Patil, Vishal Madle, Raviraj Joshi,
- Abstract summary: Stopwords are commonly used words in a language that are considered to be of little value in determining the meaning or significance of a document.
Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences.
- Score: 0.4499833362998489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stopwords are commonly used words in a language that are often considered to be of little value in determining the meaning or significance of a document. These words occur frequently in most texts and don't provide much useful information for tasks like sentiment analysis and text classification. English, which is a high-resource language, takes advantage of the availability of stopwords, whereas low-resource Indian languages like Marathi are very limited, standardized, and can be used in available packages, but the number of available words in those packages is low. Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences. We make use of the TF-IDF approach coupled with human evaluation to curate a strong stopword list of 400 words. We apply the stop word removal to the text classification task and show its efficacy. The work also presents a simple recipe for stopword curation in a low-resource language. The stopwords are integrated into the mahaNLP library and publicly available on https://github.com/l3cube-pune/MarathiNLP .
Related papers
- KSW: Khmer Stop Word based Dictionary for Keyword Extraction [0.0]
This paper introduces KSW, a Khmer-specific approach to keyword extraction that leverages a specialized stop word dictionary.
KSW addresses this by developing a tailored stop word dictionary and implementing a preprocessing methodology to remove stop words.
Experiments demonstrate that KSW achieves substantial improvements in accuracy and relevance compared to previous methods.
arXiv Detail & Related papers (2024-05-27T17:42:54Z) - Text Categorization Can Enhance Domain-Agnostic Stopword Extraction [3.6048839315645442]
This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP)
By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages.
arXiv Detail & Related papers (2024-01-24T11:52:05Z) - Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT)
We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Accuracy of the Uzbek stop words detection: a case study on "School
corpus" [0.0]
We present a method to evaluate the quality of a list of stop words aimed at automatically creating techniques.
The method was tested on an automatically-generated list of stop words for the Uzbek language.
arXiv Detail & Related papers (2022-09-15T05:14:31Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Taking Notes on the Fly Helps BERT Pre-training [94.43953312613577]
Taking Notes on the Fly (TNF) takes notes for rare words on the fly during pre-training to help the model understand them when they occur next time.
TNF provides better data utilization since cross-sentence information is employed to cover the inadequate semantics caused by rare words in the sentences.
arXiv Detail & Related papers (2020-08-04T11:25:09Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z) - Language-Independent Tokenisation Rivals Language-Specific Tokenisation
for Word Similarity Prediction [12.376752724719005]
Language-independent tokenisation (LIT) methods do not require labelled language resources or lexicons.
Language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources.
We empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages.
arXiv Detail & Related papers (2020-02-25T16:24:42Z) - Unsupervised Separation of Native and Loanwords for Malayalam and Telugu [3.4925763160992402]
Words from one language are adopted within a different language without translation; these words appear in transliterated form in text written in the latter language.
This phenomenon is particularly widespread within Indian languages where many words are loaned from English.
We address the task of identifying loanwords automatically and in an unsupervised manner, from large datasets of words from agglutinative Dravidian languages.
arXiv Detail & Related papers (2020-02-12T04:01:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.