A New Dataset and Methodology for Malicious URL Classification
- URL: http://arxiv.org/abs/2501.00356v1
- Date: Tue, 31 Dec 2024 09:10:38 GMT
- Title: A New Dataset and Methodology for Malicious URL Classification
- Authors: Ilan Schvartzman, Roei Sarussi, Maor Ashkenazi, Ido kringel, Yaniv Tocker, Tal Furman Shohet,
- Abstract summary: Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats.
Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models.
We introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench.
- Score: 2.835223467109843
- License:
- Abstract: Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based URL classifiers, applying these enhancements to URLNet. Key among these is the integration of DNS-derived features, which enrich the model's capabilities and lead to notable performance gains while preserving real-time runtime efficiency-achieving an effective balance for cybersecurity applications.
Related papers
- Efficient Phishing URL Detection Using Graph-based Machine Learning and Loopy Belief Propagation [12.89058029173131]
We propose a graph-based machine learning model for phishing URL detection.
We integrate URL structure and network-level features such as IP addresses and authoritative name servers.
Experiments on real-world datasets demonstrate our model's effectiveness by achieving F1 score of up to 98.77%.
arXiv Detail & Related papers (2025-01-12T19:49:00Z) - Enhancing web traffic attacks identification through ensemble methods and feature selection [1.3652530361013693]
This study aims to enhance the identification of web traffic attacks by leveraging machine learning techniques.
A methodology was proposed to extract relevant features from HTTP traces using the CSIC2010 v2 dataset.
Ensemble methods, such as Random Forest and Extreme Gradient Boosting, were employed and compared against baseline classifiers.
arXiv Detail & Related papers (2024-12-21T22:13:30Z) - Boosting Alignment for Post-Unlearning Text-to-Image Generative Models [55.82190434534429]
Large-scale generative models have shown impressive image-generation capabilities, propelled by massive data.
This often inadvertently leads to the generation of harmful or inappropriate content and raises copyright concerns.
We propose a framework that seeks an optimal model update at each unlearning iteration, ensuring monotonic improvement on both objectives.
arXiv Detail & Related papers (2024-12-09T21:36:10Z) - Few-Shot Class-Incremental Learning with Non-IID Decentralized Data [12.472285188772544]
Few-shot class-incremental learning is crucial for developing scalable and adaptive intelligent systems.
This paper introduces federated few-shot class-incremental learning, a decentralized machine learning paradigm.
We present a synthetic data-driven framework that leverages replay buffer data to maintain existing knowledge and facilitate the acquisition of new knowledge.
arXiv Detail & Related papers (2024-09-18T02:48:36Z) - DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification [4.585051136007553]
We introduce DomURLs_BERT, a pre-trained BERT-based encoder for detecting and classifying suspicious/malicious domains and URLs.
The proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets.
arXiv Detail & Related papers (2024-09-13T18:59:13Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - URLBERT:A Contrastive and Adversarial Pre-trained Model for URL
Classification [10.562100395816595]
URLs play a crucial role in understanding and categorizing web content.
This paper introduces URLBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks.
arXiv Detail & Related papers (2024-02-18T07:51:20Z) - Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions.
We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive
Object Re-ID [55.21702895051287]
Domain adaptive object re-ID aims to transfer the learned knowledge from the labeled source domain to the unlabeled target domain.
We propose a novel self-paced contrastive learning framework with hybrid memory.
Our method outperforms state-of-the-arts on multiple domain adaptation tasks of object re-ID.
arXiv Detail & Related papers (2020-06-04T09:12:44Z) - Contradictory Structure Learning for Semi-supervised Domain Adaptation [67.89665267469053]
Current adversarial adaptation methods attempt to align the cross-domain features.
Two challenges remain unsolved: 1) the conditional distribution mismatch and 2) the bias of the decision boundary towards the source domain.
We propose a novel framework for semi-supervised domain adaptation by unifying the learning of opposite structures.
arXiv Detail & Related papers (2020-02-06T22:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.