Classification of Spam Emails through Hierarchical Clustering and
Supervised Learning
- URL: http://arxiv.org/abs/2005.08773v2
- Date: Thu, 28 May 2020 15:36:25 GMT
- Title: Classification of Spam Emails through Hierarchical Clustering and
Supervised Learning
- Authors: Francisco J\'a\~nez-Martino, Eduardo Fidalgo, Santiago
Gonz\'alez-Mart\'inez, Javier Velasco-Mata
- Abstract summary: We propose to classify spam email in categories to improve the handle of already detected spam emails.
For the task of multi-class spam classification, the use of TF-IDF combined with SVM for the best micro F1 score performance, $95.39%$, and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an email in $2.13$ms.
- Score: 1.8065361710947976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spammers take advantage of email popularity to send indiscriminately
unsolicited emails. Although researchers and organizations continuously develop
anti-spam filters based on binary classification, spammers bypass them through
new strategies, like word obfuscation or image-based spam. For the first time
in literature, we propose to classify spam email in categories to improve the
handle of already detected spam emails, instead of just using a binary model.
First, we applied a hierarchical clustering algorithm to create SPEMC-$11$K
(SPam EMail Classification), the first multi-class dataset, which contains
three types of spam emails: Health and Technology, Personal Scams, and Sexual
Content. Then, we used SPEMC-$11$K to evaluate the combination of TF-IDF and
BOW encodings with Na\"ive Bayes, Decision Trees and SVM classifiers. Finally,
we recommend for the task of multi-class spam classification the use of (i)
TF-IDF combined with SVM for the best micro F1 score performance, $95.39\%$,
and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an
email in $2.13$ms.
Related papers
- Investigating the Effectiveness of Bayesian Spam Filters in Detecting LLM-modified Spam Mails [1.6298172960110866]
Spam and phishing remain critical threats in cybersecurity, responsible for nearly 90% of security incidents.
As these attacks grow in sophistication, the need for robust defensive mechanisms intensifies.
The emergence of large language models (LLMs) such as ChatGPT presents new challenges.
This work aims to evaluate the robustness and effectiveness of SpamAssassin against LLM-modified email content.
arXiv Detail & Related papers (2024-08-26T14:25:30Z) - Federated Combinatorial Multi-Agent Multi-Armed Bandits [79.1700188160944]
This paper introduces a federated learning framework tailored for online optimization with bandit.
In this setting, agents subsets of arms, observe noisy rewards for these subsets without accessing individual arm information, and can cooperate and share information at specific intervals.
arXiv Detail & Related papers (2024-05-09T17:40:09Z) - Prompted Contextual Vectors for Spear-Phishing Detection [45.07804966535239]
Spear-phishing attacks present a significant security challenge.
We propose a detection approach based on a novel document vectorization method.
Our method achieves a 91% F1 score in identifying LLM-generated spear-phishing emails.
arXiv Detail & Related papers (2024-02-13T09:12:55Z) - Classifying spam emails using agglomerative hierarchical clustering and
a topic-based approach [0.0]
We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes.
We evaluate 16 pipelines, combining text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT- and four classifiers: Support Vector Machine, N"aive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset
arXiv Detail & Related papers (2024-02-07T22:19:08Z) - Building an Effective Email Spam Classification Model with spaCy [0.0]
Author has used spaCy natural language processing library and 3 machine learning (ML) algorithms Naive Bayes (NB), Decision Tree C45 and Multilayer Perceptron (MLP) in Python programming language to detect spam emails collected from Gmail service.
arXiv Detail & Related papers (2023-03-15T17:41:11Z) - Anomaly Detection in Emails using Machine Learning and Header
Information [0.0]
Anomalies in emails such as phishing and spam present major security risks.
Previous studies on email anomaly detection relied on a single type of anomaly and the analysis of the email body and subject content.
This study conducted feature extraction and selection on email header datasets and leveraged both multi and one-class anomaly detection approaches.
arXiv Detail & Related papers (2022-03-19T23:31:23Z) - Deep convolutional forest: a dynamic deep ensemble approach for spam
detection in text [219.15486286590016]
This paper introduces a dynamic deep ensemble model for spam detection that adjusts its complexity and extracts features automatically.
As a result, the model achieved high precision, recall, f1-score and accuracy of 98.38%.
arXiv Detail & Related papers (2021-10-10T17:19:37Z) - Rank-Consistency Deep Hashing for Scalable Multi-Label Image Search [90.30623718137244]
We propose a novel deep hashing method for scalable multi-label image search.
A new rank-consistency objective is applied to align the similarity orders from two spaces.
A powerful loss function is designed to penalize the samples whose semantic similarity and hamming distance are mismatched.
arXiv Detail & Related papers (2021-02-02T13:46:58Z) - Privacy-Preserving Spam Filtering using Functional Encryption [1.0019926246026924]
We construct a spam classification framework that enables the classification of encrypted emails.
Our model is based on a neural network with a quadratic network part and a multi-layer perception network part.
The evaluation results on real-world spam datasets indicate that our proposed spam classification model achieves an accuracy of over 96%.
arXiv Detail & Related papers (2020-12-08T02:14:28Z) - Robust and Verifiable Information Embedding Attacks to Deep Neural
Networks via Error-Correcting Codes [81.85509264573948]
In the era of deep learning, a user often leverages a third-party machine learning tool to train a deep neural network (DNN) classifier.
In an information embedding attack, an attacker is the provider of a malicious third-party machine learning tool.
In this work, we aim to design information embedding attacks that are verifiable and robust against popular post-processing methods.
arXiv Detail & Related papers (2020-10-26T17:42:42Z) - Learning with Weak Supervision for Email Intent Detection [56.71599262462638]
We propose to leverage user actions as a source of weak supervision to detect intents in emails.
We develop an end-to-end robust deep neural network model for email intent identification.
arXiv Detail & Related papers (2020-05-26T23:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.