PhishKey: A Novel Centroid-Based Approach for Enhanced Phishing Detection Using Adaptive HTML Component Extraction
- URL: http://arxiv.org/abs/2506.21106v1
- Date: Thu, 26 Jun 2025 09:04:55 GMT
- Title: PhishKey: A Novel Centroid-Based Approach for Enhanced Phishing Detection Using Adaptive HTML Component Extraction
- Authors: Felipe Castaño, Eduardo Fidalgo, Enrique Alegre, Rocio Alaiz-Rodríguez, Raul Orduna, Francesco Zola,
- Abstract summary: PhishKey is a novel phishing detection method using automatic feature extraction from hybrid sources.<n>It combines character-level processing with CNN for URL classification, and a Centroid-Based Key Component Phishing Extractor (CAPE) for HTML content at the word level.<n>It achieves up to 98.70% F1 Score and shows strong resistance to adversarial manipulations such as injection attacks with minimal performance degradation.
- Score: 6.514730930658212
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Phishing attacks pose a significant cybersecurity threat, evolving rapidly to bypass detection mechanisms and exploit human vulnerabilities. This paper introduces PhishKey to address the challenges of adaptability, robustness, and efficiency. PhishKey is a novel phishing detection method using automatic feature extraction from hybrid sources. PhishKey combines character-level processing with Convolutional Neural Networks (CNN) for URL classification, and a Centroid-Based Key Component Phishing Extractor (CAPE) for HTML content at the word level. CAPE reduces noise and ensures complete sample processing avoiding crop operations on the input data. The predictions from both modules are integrated using a soft-voting ensemble to achieve more accurate and reliable classifications. Experimental evaluations on four state-of-the-art datasets demonstrate the effectiveness of PhishKey. It achieves up to 98.70% F1 Score and shows strong resistance to adversarial manipulations such as injection attacks with minimal performance degradation.
Related papers
- Wavelet Scattering Transform and Fourier Representation for Offline Detection of Malicious Clients in Federated Learning [44.99833362998488]
Federated Learning (FL) enables the training of machine learning models across decentralized clients while preserving data privacy.<n>The presence of anomalous or corrupted clients can significantly degrade model performance.<n>We propose WAFFLE a detection algorithm that labels malicious clients it before training<n>A lightweight detector, trained on a distillated public dataset, performs the labeling with minimal communication and computational overhead.
arXiv Detail & Related papers (2025-06-11T12:48:00Z) - PhishVQC: Optimizing Phishing URL Detection with Correlation Based Feature Selection and Variational Quantum Classifier [0.0]
Motivated by quantum computing, this paper proposes using Variational Quantums (VQC) to enhance phishing URL detection.<n>We present PhishVQC, a quantum model that combines quantum maps and variational ansatzes such as RealAmplitude and EfficientSU2.<n>This highlights the potential quantum machine learning to improve phishing detection accuracy.
arXiv Detail & Related papers (2025-03-03T18:28:01Z) - PhishGuard: A Multi-Layered Ensemble Model for Optimal Phishing Website Detection [0.0]
Phishing attacks are a growing cybersecurity threat, leveraging deceptive techniques to steal sensitive information through malicious websites.
This paper introduces PhishGuard, an optimal custom ensemble model designed to improve phishing site detection.
The model combines multiple machine learning classifiers, including Random Forest, Gradient Boosting, CatBoost, and XGBoost, to enhance detection accuracy.
arXiv Detail & Related papers (2024-09-29T23:15:57Z) - Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture [58.60915132222421]
We introduce an approach that is both general and parameter-efficient for face forgery detection.
We design a forgery-style mixture formulation that augments the diversity of forgery source domains.
We show that the designed model achieves state-of-the-art generalizability with significantly reduced trainable parameters.
arXiv Detail & Related papers (2024-08-23T01:53:36Z) - A Sophisticated Framework for the Accurate Detection of Phishing Websites [0.0]
Phishing is an increasingly sophisticated form of cyberattack that is inflicting huge financial damage to corporations throughout the globe.
This paper proposes a comprehensive methodology for detecting phishing websites.
A combination of feature selection, greedy algorithm, cross-validation, and deep learning methods have been utilized to construct a sophisticated stacking ensemble.
arXiv Detail & Related papers (2024-03-13T14:26:25Z) - AntiPhishStack: LSTM-based Stacked Generalization Model for Optimized
Phishing URL Detection [0.32141666878560626]
This paper introduces a two-phase stack generalized model named AntiPhishStack, designed to detect phishing sites.
The model leverages the learning of URLs and character-level TF-IDF features symmetrically, enhancing its ability to combat emerging phishing threats.
Experimental validation on two benchmark datasets, comprising benign and phishing or malicious URLs, demonstrates the model's exceptional performance, achieving a notable 96.04% accuracy compared to existing studies.
arXiv Detail & Related papers (2024-01-17T03:44:27Z) - IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks [45.81957796169348]
Backdoor attacks are an insidious security threat against machine learning models.
We introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks.
Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers.
arXiv Detail & Related papers (2023-05-25T22:08:57Z) - PhishSim: Aiding Phishing Website Detection with a Feature-Free Tool [12.468922937529966]
We propose a feature-free method for detecting phishing websites using the Normalized Compression Distance (NCD)
This measure computes the similarity of two websites by compressing them, thus eliminating the need to perform any feature extraction.
We use the Furthest Point First algorithm to perform phishing prototype extractions, in order to select instances that are representative of a cluster of phishing webpages.
arXiv Detail & Related papers (2022-07-13T20:44:03Z) - Deep convolutional forest: a dynamic deep ensemble approach for spam
detection in text [219.15486286590016]
This paper introduces a dynamic deep ensemble model for spam detection that adjusts its complexity and extracts features automatically.
As a result, the model achieved high precision, recall, f1-score and accuracy of 98.38%.
arXiv Detail & Related papers (2021-10-10T17:19:37Z) - Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV)
We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples.
Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z) - How Robust are Randomized Smoothing based Defenses to Data Poisoning? [66.80663779176979]
We present a previously unrecognized threat to robust machine learning models that highlights the importance of training-data quality.
We propose a novel bilevel optimization-based data poisoning attack that degrades the robustness guarantees of certifiably robust classifiers.
Our attack is effective even when the victim trains the models from scratch using state-of-the-art robust training methods.
arXiv Detail & Related papers (2020-12-02T15:30:21Z) - Bayesian Optimization with Machine Learning Algorithms Towards Anomaly
Detection [66.05992706105224]
In this paper, an effective anomaly detection framework is proposed utilizing Bayesian Optimization technique.
The performance of the considered algorithms is evaluated using the ISCX 2012 dataset.
Experimental results show the effectiveness of the proposed framework in term of accuracy rate, precision, low-false alarm rate, and recall.
arXiv Detail & Related papers (2020-08-05T19:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.