Effects of term weighting approach with and without stop words removing
on Arabic text classification
- URL: http://arxiv.org/abs/2402.14867v1
- Date: Wed, 21 Feb 2024 11:31:04 GMT
- Title: Effects of term weighting approach with and without stop words removing
on Arabic text classification
- Authors: Esra'a Alhenawi, Ruba Abu Khurma, Pedro A. Castillo, Maribel G. Arenas
- Abstract summary: This study compares the effects of Binary and Term frequency weighting feature methodologies on the text's classification method when stop words are eliminated.
For all metrics, the term frequency feature weighting approach with stop word removal outperforms the binary approach.
It is clear from the data that, using the same phrase weighting approach, stop word removing increases classification accuracy.
- Score: 0.9217021281095907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classifying text is a method for categorizing documents into pre-established
groups. Text documents must be prepared and represented in a way that is
appropriate for the algorithms used for data mining prior to classification. As
a result, a number of term weighting strategies have been created in the
literature to enhance text categorization algorithms' functionality. This study
compares the effects of Binary and Term frequency weighting feature
methodologies on the text's classification method when stop words are
eliminated once and when they are not. In recognition of assessing the effects
of prior weighting of features approaches on classification results in terms of
accuracy, recall, precision, and F-measure values, we used an Arabic data set
made up of 322 documents divided into six main topics (agriculture, economy,
health, politics, science, and sport), each of which contains 50 documents,
with the exception of the health category, which contains 61 documents. The
results demonstrate that for all metrics, the term frequency feature weighting
approach with stop word removal outperforms the binary approach, while for
accuracy, recall, and F-Measure, the binary approach outperforms the TF
approach without stop word removal. However, for precision, the two approaches
produce results that are very similar. Additionally, it is clear from the data
that, using the same phrase weighting approach, stop word removing increases
classification accuracy.
Related papers
- Detection of tortured phrases in scientific literature [0.0]
This paper presents various automatic detection methods to extract so called tortured phrases from scientific papers.
With a recall value of.87 and a precision value of.61, it could retrieve new tortured phrases to be submitted to domain experts for validation.
arXiv Detail & Related papers (2024-02-02T08:15:43Z) - Like a Good Nearest Neighbor: Practical Content Moderation and Text
Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor.
LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Quantitative Stopword Generation for Sentiment Analysis via Recursive
and Iterative Deletion [2.0305676256390934]
Stopwords carry little semantic information and are often removed from text data to reduce dataset size.
We present a novel approach to generate effective stopword sets for specific NLP tasks.
arXiv Detail & Related papers (2022-09-04T03:04:10Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Detect and Classify -- Joint Span Detection and Classification for
Health Outcomes [15.496885113949252]
We propose a method that uses both word-level and sentence-level information to simultaneously perform outcome span detection and outcome type classification.
Experimental results on several benchmark datasets for health outcome detection show that our model consistently outperforms decoupled methods.
arXiv Detail & Related papers (2021-04-15T21:47:15Z) - Unsupervised Label Refinement Improves Dataless Text Classification [48.031421660674745]
Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description.
While promising, it crucially relies on accurate descriptions of the label set for each downstream task.
This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice.
arXiv Detail & Related papers (2020-12-08T03:37:50Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Rank over Class: The Untapped Potential of Ranking in Natural Language
Processing [8.637110868126546]
We argue that many tasks which are currently addressed using classification are in fact being shoehorned into a classification mould.
We propose a novel end-to-end ranking approach consisting of a Transformer network responsible for producing representations for a pair of text sequences.
In an experiment on a heavily-skewed sentiment analysis dataset, converting ranking results to classification labels yields an approximately 22% improvement over state-of-the-art text classification.
arXiv Detail & Related papers (2020-09-10T22:18:57Z) - Research on Annotation Rules and Recognition Algorithm Based on Phrase
Window [4.334276223622026]
We propose labeling rules based on phrase windows, and designed corresponding phrase recognition algorithms.
The labeling rule uses phrases as the minimum unit, di-vides sentences into 7 types of nestable phrase types, and marks the grammatical dependencies between phrases.
The corresponding algorithm, drawing on the idea of identifying the target area in the image field, can find the start and end positions of various phrases in the sentence.
arXiv Detail & Related papers (2020-07-07T00:19:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.