Classifying spam emails using agglomerative hierarchical clustering and
a topic-based approach
- URL: http://arxiv.org/abs/2402.05296v1
- Date: Wed, 7 Feb 2024 22:19:08 GMT
- Title: Classifying spam emails using agglomerative hierarchical clustering and
a topic-based approach
- Authors: F. Janez-Martino, R. Alaiz-Rodriguez, V. Gonzalez-Castro, E. Fidalgo,
and E. Alegre
- Abstract summary: We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes.
We evaluate 16 pipelines, combining text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT- and four classifiers: Support Vector Machine, N"aive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spam emails are unsolicited, annoying and sometimes harmful messages which
may contain malware, phishing or hoaxes. Unlike most studies that address the
design of efficient anti-spam filters, we approach the spam email problem from
a different and novel perspective. Focusing on the needs of cybersecurity
units, we follow a topic-based approach for addressing the classification of
spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S,
two novel datasets with approximately 15K emails each in English and Spanish,
respectively, and we label them using agglomerative hierarchical clustering
into 11 classes. We evaluate 16 pipelines, combining four text representation
techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words,
Word2Vec and BERT- and four classifiers: Support Vector Machine, N\"aive Bayes,
Random Forest and Logistic Regression. Experimental results show that the
highest performance is achieved with TF-IDF and LR for the English dataset,
with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish
dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy.
Regarding the processing time, TF-IDF with LR leads to the fastest
classification, processing an English and Spanish spam email in and on average,
respectively.
Related papers
- Zero-Shot Spam Email Classification Using Pre-trained Large Language Models [0.0]
This paper investigates the application of pre-trained large language models (LLMs) for spam email classification using zero-shot prompting.
We evaluate the performance of both open-source (Flan-T5) and proprietary LLMs (ChatGPT, GPT-4) on the well-known SpamAssassin dataset.
arXiv Detail & Related papers (2024-05-24T20:55:49Z) - Evaluating the Performance of ChatGPT for Spam Email Detection [9.585304538597414]
This study attempts to evaluate ChatGPT's capabilities for spam identification in both English and Chinese email datasets.
We employ ChatGPT for spam email detection using in-context learning, which requires a prompt instruction and a few demonstrations.
We also investigate how the number of demonstrations in the prompt affects the performance of ChatGPT.
arXiv Detail & Related papers (2024-02-23T04:52:08Z) - Prompted Contextual Vectors for Spear-Phishing Detection [45.07804966535239]
Spear-phishing attacks present a significant security challenge.
We propose a detection approach based on a novel document vectorization method.
Our method achieves a 91% F1 score in identifying LLM-generated spear-phishing emails.
arXiv Detail & Related papers (2024-02-13T09:12:55Z) - Building an Effective Email Spam Classification Model with spaCy [0.0]
Author has used spaCy natural language processing library and 3 machine learning (ML) algorithms Naive Bayes (NB), Decision Tree C45 and Multilayer Perceptron (MLP) in Python programming language to detect spam emails collected from Gmail service.
arXiv Detail & Related papers (2023-03-15T17:41:11Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Anomaly Detection in Emails using Machine Learning and Header
Information [0.0]
Anomalies in emails such as phishing and spam present major security risks.
Previous studies on email anomaly detection relied on a single type of anomaly and the analysis of the email body and subject content.
This study conducted feature extraction and selection on email header datasets and leveraged both multi and one-class anomaly detection approaches.
arXiv Detail & Related papers (2022-03-19T23:31:23Z) - Deep convolutional forest: a dynamic deep ensemble approach for spam
detection in text [219.15486286590016]
This paper introduces a dynamic deep ensemble model for spam detection that adjusts its complexity and extracts features automatically.
As a result, the model achieved high precision, recall, f1-score and accuracy of 98.38%.
arXiv Detail & Related papers (2021-10-10T17:19:37Z) - Detecting Handwritten Mathematical Terms with Sensor Based Data [71.84852429039881]
We propose a solution to the UbiComp 2021 Challenge by Stabilo in which handwritten mathematical terms are supposed to be automatically classified.
The input data set contains data of different writers, with label strings constructed from a total of 15 different possible characters.
arXiv Detail & Related papers (2021-09-12T19:33:34Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - Classification of Spam Emails through Hierarchical Clustering and
Supervised Learning [1.8065361710947976]
We propose to classify spam email in categories to improve the handle of already detected spam emails.
For the task of multi-class spam classification, the use of TF-IDF combined with SVM for the best micro F1 score performance, $95.39%$, and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an email in $2.13$ms.
arXiv Detail & Related papers (2020-05-18T14:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.