Related papers: Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval

Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval

URL: http://arxiv.org/abs/2406.11029v1
Date: Sun, 16 Jun 2024 17:59:05 GMT
Title: Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval
Authors: Rohan Chavan, Gaurav Patil, Vishal Madle, Raviraj Joshi,
Abstract summary: Stopwords are commonly used words in a language that are considered to be of little value in determining the meaning or significance of a document. Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences.
Score: 0.4499833362998489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stopwords are commonly used words in a language that are often considered to be of little value in determining the meaning or significance of a document. These words occur frequently in most texts and don't provide much useful information for tasks like sentiment analysis and text classification. English, which is a high-resource language, takes advantage of the availability of stopwords, whereas low-resource Indian languages like Marathi are very limited, standardized, and can be used in available packages, but the number of available words in those packages is low. Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences. We make use of the TF-IDF approach coupled with human evaluation to curate a strong stopword list of 400 words. We apply the stop word removal to the text classification task and show its efficacy. The work also presents a simple recipe for stopword curation in a low-resource language. The stopwords are integrated into the mahaNLP library and publicly available on https://github.com/l3cube-pune/MarathiNLP .

Related papers

SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models [47.604473591750605]
This paper proposes a novel task called textbfAutomatic textbfDictionary textbfSelection (textbfADS)<n>The goal of the task is to automatically select which dictionary to use to enhance translation.
arXiv Detail & Related papers (2025-07-25T02:51:14Z)
Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries [22.562544826766917]
Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages.<n>Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.
arXiv Detail & Related papers (2025-06-02T10:52:52Z)
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP) When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z)
How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text? [38.1823640848362]
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue.
arXiv Detail & Related papers (2024-06-17T12:42:34Z)
KSW: Khmer Stop Word based Dictionary for Keyword Extraction [0.0]
This paper introduces KSW, a Khmer-specific approach to keyword extraction that leverages a specialized stop word dictionary. KSW addresses this by developing a tailored stop word dictionary and implementing a preprocessing methodology to remove stop words. Experiments demonstrate that KSW achieves substantial improvements in accuracy and relevance compared to previous methods.
arXiv Detail & Related papers (2024-05-27T17:42:54Z)
Text Categorization Can Enhance Domain-Agnostic Stopword Extraction [3.6048839315645442]
This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP) By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages.
arXiv Detail & Related papers (2024-01-24T11:52:05Z)
Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT) We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z)
Romanization-based Large-scale Adaptation of Multilingual Language Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z)
DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries. We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z)
Accuracy of the Uzbek stop words detection: a case study on "School corpus" [0.0]
We present a method to evaluate the quality of a list of stop words aimed at automatically creating techniques. The method was tested on an automatically-generated list of stop words for the Uzbek language.
arXiv Detail & Related papers (2022-09-15T05:14:31Z)
naab: A ready-to-use plug-and-play corpus for Farsi [1.381198851698147]
naab is the largest publicly available, cleaned, and ready-to-use Farsi textual corpus. Naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words. Naab-raw is an unprocessed version of the dataset, along with a pre-processing toolkit.
arXiv Detail & Related papers (2022-08-29T10:40:58Z)
Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem. For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z)
Taking Notes on the Fly Helps BERT Pre-training [94.43953312613577]
Taking Notes on the Fly (TNF) takes notes for rare words on the fly during pre-training to help the model understand them when they occur next time. TNF provides better data utilization since cross-sentence information is employed to cover the inadequate semantics caused by rare words in the sentences.
arXiv Detail & Related papers (2020-08-04T11:25:09Z)
Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction [12.376752724719005]
Language-independent tokenisation (LIT) methods do not require labelled language resources or lexicons. Language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources. We empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages.
arXiv Detail & Related papers (2020-02-25T16:24:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.