Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AI
- URL: http://arxiv.org/abs/2411.09813v2
- Date: Fri, 22 Nov 2024 16:14:02 GMT
- Title: Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AI
- Authors: Maraz Mia, Darius Derakhshan, Mir Mehedi A. Pritom,
- Abstract summary: Phishing has been a prevalent cyber threat that manipulates users into revealing sensitive private information through deceptive tactics.
proactively detection of phishing URLs (or websites) has been established as an widely-accepted defense approach.
We analyze two publicly available phishing URL datasets, where each dataset has its own set of unique and overlapping features related to URL string and website contents.
- Score: 0.0
- License:
- Abstract: Phishing has been a prevalent cyber threat that manipulates users into revealing sensitive private information through deceptive tactics, designed to masquerade as trustworthy entities. Over the years, proactively detection of phishing URLs (or websites) has been established as an widely-accepted defense approach. In literature, we often find supervised Machine Learning (ML) models with highly competitive performance for detecting phishing websites based on the extracted features from both phishing and benign (i.e., legitimate) websites. However, it is still unclear if these features or indicators are dependent on a particular dataset or they are generalized for overall phishing detection. In this paper, we delve deeper into this issue by analyzing two publicly available phishing URL datasets, where each dataset has its own set of unique and overlapping features related to URL string and website contents. We want to investigate if overlapping features are similar in nature across datasets and how does the model perform when trained on one dataset and tested on the other. We conduct practical experiments and leverage explainable AI (XAI) methods such as SHAP plots to provide insights into different features' contributions in case of phishing detection to answer our primary question, "Can features for phishing URL detection be trusted across diverse dataset?". Our case study experiment results show that features for phishing URL detection can often be dataset-dependent and thus may not be trusted across different datasets even though they share same set of feature behaviors.
Related papers
- PhishNet: A Phishing Website Detection Tool using XGBoost [1.777434178384403]
PhisNet is a cutting-edge web application designed to detect phishing websites using advanced machine learning.
It aims to help individuals and organizations identify and prevent phishing attacks through a robust AI framework.
arXiv Detail & Related papers (2024-06-29T21:31:13Z) - PhishGuard: A Convolutional Neural Network Based Model for Detecting Phishing URLs with Explainability Analysis [1.102674168371806]
Phishing URL identification is the best way to address the problem.
Various machine learning and deep learning methods have been proposed to automate the detection of phishing URLs.
We propose a 1D Convolutional Neural Network (CNN) and trained the model with extensive features and a substantial amount of data.
arXiv Detail & Related papers (2024-04-27T17:13:49Z) - A Sophisticated Framework for the Accurate Detection of Phishing Websites [0.0]
Phishing is an increasingly sophisticated form of cyberattack that is inflicting huge financial damage to corporations throughout the globe.
This paper proposes a comprehensive methodology for detecting phishing websites.
A combination of feature selection, greedy algorithm, cross-validation, and deep learning methods have been utilized to construct a sophisticated stacking ensemble.
arXiv Detail & Related papers (2024-03-13T14:26:25Z) - Federated Causal Discovery from Heterogeneous Data [70.31070224690399]
We propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data.
These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy.
We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.
arXiv Detail & Related papers (2024-02-20T18:53:53Z) - Mitigating Bias in Machine Learning Models for Phishing Webpage Detection [0.8050163120218178]
Phishing, a well-known cyberattack, revolves around the creation of phishing webpages and the dissemination of corresponding URLs.
Various techniques are available for preemptively categorizing zero-day phishing URLs by distilling unique attributes and constructing predictive models.
This proposal delves into persistent challenges within phishing detection solutions, particularly concentrated on the preliminary phase of assembling comprehensive datasets.
We propose a potential solution in the form of a tool engineered to alleviate bias in ML models.
arXiv Detail & Related papers (2024-01-16T13:45:54Z) - Client-specific Property Inference against Secure Aggregation in
Federated Learning [52.8564467292226]
Federated learning has become a widely used paradigm for collaboratively training a common model among different participants.
Many attacks have shown that it is still possible to infer sensitive information such as membership, property, or outright reconstruction of participant data.
We show that simple linear models can effectively capture client-specific properties only from the aggregated model updates.
arXiv Detail & Related papers (2023-03-07T14:11:01Z) - CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive
Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks.
CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z) - Robust Transferable Feature Extractors: Learning to Defend Pre-Trained
Networks Against White Box Adversaries [69.53730499849023]
We show that adversarial examples can be successfully transferred to another independently trained model to induce prediction errors.
We propose a deep learning-based pre-processing mechanism, which we refer to as a robust transferable feature extractor (RTFE)
arXiv Detail & Related papers (2022-09-14T21:09:34Z) - Black-box Dataset Ownership Verification via Backdoor Watermarking [67.69308278379957]
We formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model.
We propose to embed external patterns via backdoor watermarking for the ownership verification to protect them.
Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification.
arXiv Detail & Related papers (2022-08-04T05:32:20Z) - PhishSim: Aiding Phishing Website Detection with a Feature-Free Tool [12.468922937529966]
We propose a feature-free method for detecting phishing websites using the Normalized Compression Distance (NCD)
This measure computes the similarity of two websites by compressing them, thus eliminating the need to perform any feature extraction.
We use the Furthest Point First algorithm to perform phishing prototype extractions, in order to select instances that are representative of a cluster of phishing webpages.
arXiv Detail & Related papers (2022-07-13T20:44:03Z) - Finding Facial Forgery Artifacts with Parts-Based Detectors [73.08584805913813]
We design a series of forgery detection systems that each focus on one individual part of the face.
We use these detectors to perform detailed empirical analysis on the FaceForensics++, Celeb-DF, and Facebook Deepfake Detection Challenge datasets.
arXiv Detail & Related papers (2021-09-21T16:18:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.