Text Categorization Can Enhance Domain-Agnostic Stopword Extraction
- URL: http://arxiv.org/abs/2401.13398v1
- Date: Wed, 24 Jan 2024 11:52:05 GMT
- Title: Text Categorization Can Enhance Domain-Agnostic Stopword Extraction
- Authors: Houcemeddine Turki, Naome A. Etori, Mohamed Ali Hadj Taieb,
Abdul-Hakeem Omotayo, Chris Chinenye Emezue, Mohamed Ben Aouicha, Ayodele
Awokoya, Falalu Ibrahim Lawan, Doreen Nixdorf
- Abstract summary: This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP)
By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages.
- Score: 3.6048839315645442
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the role of text categorization in streamlining
stopword extraction in natural language processing (NLP), specifically focusing
on nine African languages alongside French. By leveraging the MasakhaNEWS,
African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that
text categorization effectively identifies domain-agnostic stopwords with over
80% detection success rate for most examined languages. Nevertheless,
linguistic variances result in lower detection rates for certain languages.
Interestingly, we find that while over 40% of stopwords are common across news
categories, less than 15% are unique to a single category. Uncommon stopwords
add depth to text but their classification as stopwords depends on context.
Therefore combining statistical and linguistic approaches creates comprehensive
stopword lists, highlighting the value of our hybrid method. This research
enhances NLP for African languages and underscores the importance of text
categorization in stopword extraction.
Related papers
- Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval [0.4499833362998489]
Stopwords are commonly used words in a language that are considered to be of little value in determining the meaning or significance of a document.
Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences.
arXiv Detail & Related papers (2024-06-16T17:59:05Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Lexicon and Rule-based Word Lemmatization Approach for the Somali
Language [0.0]
Lemmatization is a technique used to normalize text by changing morphological derivations of words to their root forms.
This paper pioneers the development of text lemmatization for the Somali language.
We have developed an initial lexicon of 1247 root words and 7173 derivationally related terms enriched with rules for lemmatizing words not present in the lexicon.
arXiv Detail & Related papers (2023-08-03T14:31:57Z) - Accuracy of the Uzbek stop words detection: a case study on "School
corpus" [0.0]
We present a method to evaluate the quality of a list of stop words aimed at automatically creating techniques.
The method was tested on an automatically-generated list of stop words for the Uzbek language.
arXiv Detail & Related papers (2022-09-15T05:14:31Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Disambiguatory Signals are Stronger in Word-initial Positions [48.18148856974974]
We point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word.
We find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.
arXiv Detail & Related papers (2021-02-03T18:19:16Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Novel Keyword Extraction and Language Detection Approaches [0.6445605125467573]
We propose a fast novel approach to string tokenisation for fuzzy language matching.
We experimentally demonstrate an 83.6% decrease in processing time.
We find the Accept-Language header is 14% more likely to match the classification than the IP Address.
arXiv Detail & Related papers (2020-09-24T17:28:59Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.