Leveraging Large Language Models for Cybersecurity: Enhancing SMS Spam Detection with Robust and Context-Aware Text Classification
- URL: http://arxiv.org/abs/2502.11014v1
- Date: Sun, 16 Feb 2025 06:39:36 GMT
- Title: Leveraging Large Language Models for Cybersecurity: Enhancing SMS Spam Detection with Robust and Context-Aware Text Classification
- Authors: Mohsen Ahmadi, Matin Khajavi, Abbas Varmaghani, Ali Ala, Kasra Danesh, Danial Javaheri,
- Abstract summary: This study evaluates the effectiveness of different feature extraction techniques and classification algorithms in detecting spam messages within SMS data.
We found that TF-IDF, when paired with Naive Bayes, Support Vector Machines, or Deep Neural Networks, provides the most reliable performance.
- Score: 4.281580125566764
- License:
- Abstract: This study evaluates the effectiveness of different feature extraction techniques and classification algorithms in detecting spam messages within SMS data. We analyzed six classifiers Naive Bayes, K-Nearest Neighbors, Support Vector Machines, Linear Discriminant Analysis, Decision Trees, and Deep Neural Networks using two feature extraction methods: bag-of-words and TF-IDF. The primary objective was to determine the most effective classifier-feature combination for SMS spam detection. Our research offers two main contributions: first, by systematically examining various classifier and feature extraction pairings, and second, by empirically evaluating their ability to distinguish spam messages. Our results demonstrate that the TF-IDF method consistently outperforms the bag-of-words approach across all six classifiers. Specifically, Naive Bayes with TF-IDF achieved the highest accuracy of 96.2%, with a precision of 0.976 for non-spam and 0.754 for spam messages. Similarly, Support Vector Machines with TF-IDF exhibited an accuracy of 94.5%, with a precision of 0.926 for non-spam and 0.891 for spam. Deep Neural Networks using TF-IDF yielded an accuracy of 91.0%, with a recall of 0.991 for non-spam and 0.415 for spam messages. In contrast, classifiers such as K-Nearest Neighbors, Linear Discriminant Analysis, and Decision Trees showed weaker performance, regardless of the feature extraction method employed. Furthermore, we observed substantial variability in classifier effectiveness depending on the chosen feature extraction technique. Our findings emphasize the significance of feature selection in SMS spam detection and suggest that TF-IDF, when paired with Naive Bayes, Support Vector Machines, or Deep Neural Networks, provides the most reliable performance. These insights provide a foundation for improving SMS spam detection through optimized feature extraction and classification methods.
Related papers
- A Comparative Study on TF-IDF feature Weighting Method and its Analysis
using Unstructured Dataset [0.5156484100374058]
Term Frequency-Inverse Document Frequency (TF-IDF) and Natural Language Processing (NLP) are the most highly used information retrieval methods in text classification.
We have investigated and analyzed the feature weighting method for text classification on unstructured data.
The proposed model considered two features N-Grams and TF-IDF on IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis.
arXiv Detail & Related papers (2023-08-08T04:27:34Z) - FDINet: Protecting against DNN Model Extraction via Feature Distortion Index [25.69643512837956]
FDINET is a novel defense mechanism that leverages the feature distribution of deep neural network (DNN) models.
It exploits FDI similarity to identify colluding adversaries from distributed extraction attacks.
FDINET exhibits the capability to identify colluding adversaries with an accuracy exceeding 91%.
arXiv Detail & Related papers (2023-06-20T07:14:37Z) - Detecting automatically the layout of clinical documents to enhance the
performances of downstream natural language processing [53.797797404164946]
We designed an algorithm to process clinical PDF documents and extract only clinically relevant text.
The algorithm consists of several steps: initial text extraction using a PDF, followed by classification into such categories as body text, left notes, and footers.
Medical performance was evaluated by examining the extraction of medical concepts of interest from the text in their respective sections.
arXiv Detail & Related papers (2023-05-23T08:38:33Z) - DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability
Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function.
We then define a new curvature-based criterion for judging if a passage is generated from a given LLM.
We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z) - Android Malware Detection using Feature Ranking of Permissions [0.0]
We use Android permissions as a vehicle to allow for quick and effective differentiation between benign and malware apps.
Our analysis indicates that this approach can result in better accuracy and F-score value than other reported approaches.
arXiv Detail & Related papers (2022-01-20T22:08:20Z) - Deep convolutional forest: a dynamic deep ensemble approach for spam
detection in text [219.15486286590016]
This paper introduces a dynamic deep ensemble model for spam detection that adjusts its complexity and extracts features automatically.
As a result, the model achieved high precision, recall, f1-score and accuracy of 98.38%.
arXiv Detail & Related papers (2021-10-10T17:19:37Z) - NLRG at SemEval-2021 Task 5: Toxic Spans Detection Leveraging BERT-based
Token Classification and Span Prediction Techniques [0.6850683267295249]
In this paper, we explore simple versions of Token Classification or Span Prediction approaches.
We use BERT-based models -- BERT, RoBERTa, and SpanBERT for both approaches.
To this end, we investigate results on four hybrid approaches -- Multi-Span, Span+Token, LSTM-CRF, and a combination of predicted offsets using union/intersection.
arXiv Detail & Related papers (2021-02-24T12:30:09Z) - Detection of Adversarial Supports in Few-shot Classifiers Using Feature
Preserving Autoencoders and Self-Similarity [89.26308254637702]
We propose a detection strategy to highlight adversarial support sets.
We make use of feature preserving autoencoder filtering and also the concept of self-similarity of a support set to perform this detection.
Our method is attack-agnostic and also the first to explore detection for few-shot classifiers to the best of our knowledge.
arXiv Detail & Related papers (2020-12-09T14:13:41Z) - Discriminative Nearest Neighbor Few-Shot Intent Detection by
Transferring Natural Language Inference [150.07326223077405]
Few-shot learning is attracting much attention to mitigate data scarcity.
We present a discriminative nearest neighbor classification with deep self-attention.
We propose to boost the discriminative ability by transferring a natural language inference (NLI) model.
arXiv Detail & Related papers (2020-10-25T00:39:32Z) - Bayesian Optimization with Machine Learning Algorithms Towards Anomaly
Detection [66.05992706105224]
In this paper, an effective anomaly detection framework is proposed utilizing Bayesian Optimization technique.
The performance of the considered algorithms is evaluated using the ISCX 2012 dataset.
Experimental results show the effectiveness of the proposed framework in term of accuracy rate, precision, low-false alarm rate, and recall.
arXiv Detail & Related papers (2020-08-05T19:29:35Z) - Pulsars Detection by Machine Learning with Very Few Features [5.598468451834693]
It is an active topic to investigate the schemes based on machine learning (ML) methods for detecting pulsars.
To improve the detection performance, input features into an ML model should be investigated specifically.
arXiv Detail & Related papers (2020-02-20T01:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.