Related papers: DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

URL: http://arxiv.org/abs/2409.09143v1
Date: Fri, 13 Sep 2024 18:59:13 GMT
Title: DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification
Authors: Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismail Berrada,
Abstract summary: We introduce DomURLs_BERT, a pre-trained BERT-based encoder for detecting and classifying suspicious/malicious domains and URLs. The proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets.
Score: 4.585051136007553
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.

Related papers

Training Large Language Models for Advanced Typosquatting Detection [0.0]
Typosquatting is a cyber threat that exploits human error in typing URLs to deceive users, distribute malware, and conduct phishing attacks. This study introduces a novel approach leveraging large language models (LLMs) to enhance typosquatting detection. Experimental results indicate that the Phi-4 14B model outperformed other tested models when properly fine tuned achieving a 98% accuracy rate with only a few thousand training samples.
arXiv Detail & Related papers (2025-03-28T13:16:27Z)
Efficient Phishing URL Detection Using Graph-based Machine Learning and Loopy Belief Propagation [12.89058029173131]
We propose a graph-based machine learning model for phishing URL detection. We integrate URL structure and network-level features such as IP addresses and authoritative name servers. Experiments on real-world datasets demonstrate our model's effectiveness by achieving F1 score of up to 98.77%.
arXiv Detail & Related papers (2025-01-12T19:49:00Z)
A New Dataset and Methodology for Malicious URL Classification [2.835223467109843]
Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models. We introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench.
arXiv Detail & Related papers (2024-12-31T09:10:38Z)
ID-centric Pre-training for Recommendation [51.72177873832969]
ID embeddings are challenging to be transferred to new domains. behavioral information in ID embeddings is still verified to be dominating in PLM-based recommendation models. We propose a novel ID-centric recommendation pre-training paradigm (IDP), which directly transfers informative ID embeddings learned in pre-training domains to item representations in new domains.
arXiv Detail & Related papers (2024-05-06T15:34:31Z)
The Performance of Sequential Deep Learning Models in Detecting Phishing Websites Using Contextual Features of URLs [0.0]
This study focuses on the detection of phishing websites using deep learning models such as Multi-Head Attention, Temporal Convolutional Network (TCN), BI-LSTM, and LSTM. Results demonstrate that Multi-Head Attention and BI-LSTM model outperform some other deep learning-based algorithms such as TCN and LSTM in producing better precision, recall, and F1-scores.
arXiv Detail & Related papers (2024-04-15T13:58:22Z)
URLBERT:A Contrastive and Adversarial Pre-trained Model for URL Classification [10.562100395816595]
URLs play a crucial role in understanding and categorizing web content. This paper introduces URLBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks.
arXiv Detail & Related papers (2024-02-18T07:51:20Z)
DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness [58.23214712926585]
We develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection. Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables. We are the first to offer certified robustness in the realm of static detection of malware executables.
arXiv Detail & Related papers (2023-03-20T17:25:22Z)
ProxyMix: Proxy-based Mixup Training with Label Refinery for Source-Free Domain Adaptation [73.14508297140652]
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. We propose an effective method named Proxy-based Mixup training with label refinery ( ProxyMix) Experiments on three 2D image and one 3D point cloud object recognition benchmarks demonstrate that ProxyMix yields state-of-the-art performance for source-free UDA tasks.
arXiv Detail & Related papers (2022-05-29T03:45:00Z)
An Adversarial Attack Analysis on Malicious Advertisement URL Detection Framework [22.259444589459513]
Malicious advertisement URLs pose a security risk since they are the source of cyber-attacks. Existing malicious URL detection techniques are limited and to handle unseen features as well as generalize to test data. In this study, we extract a novel set of lexical and web-scrapped features and employ machine learning technique to set up system for fraudulent advertisement URLs detection.
arXiv Detail & Related papers (2022-04-27T20:06:22Z)
Curriculum Graph Co-Teaching for Multi-Target Domain Adaptation [78.28390172958643]
We identify two key aspects that can help to alleviate multiple domain-shifts in the multi-target domain adaptation (MTDA) We propose Curriculum Graph Co-Teaching (CGCT) that uses a dual classifier head, with one of them being a graph convolutional network (GCN) which aggregates features from similar samples across the domains. When the domain labels are available, we propose Domain-aware Curriculum Learning (DCL), a sequential adaptation strategy that first adapts on the easier target domains, followed by the harder ones.
arXiv Detail & Related papers (2021-04-01T23:41:41Z)
Improving DGA-Based Malicious Domain Classifiers for Malware Defense with Adversarial Machine Learning [0.9023847175654603]
Domain Generation Algorithms (DGAs) are used by adversaries to establish Command and Control (C&C) server communications during cyber attacks. Blacklists of known/identified C&C domains are often used as one of the defense mechanisms. We propose a new method using adversarial machine learning to generate never-before-seen malware-related domain families.
arXiv Detail & Related papers (2021-01-02T22:04:22Z)
Cassandra: Detecting Trojaned Networks from Adversarial Perturbations [92.43879594465422]
In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models. We propose a method to verify if a pre-trained model is Trojaned or benign. Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients.
arXiv Detail & Related papers (2020-07-28T19:00:40Z)
Adversarial Machine Learning Attacks and Defense Methods in the Cyber Security Domain [58.30296637276011]
This paper summarizes the latest research on adversarial attacks against security solutions based on machine learning techniques. It is the first to discuss the unique challenges of implementing end-to-end adversarial attacks in the cyber security domain.
arXiv Detail & Related papers (2020-07-05T18:22:40Z)
Inline Detection of DGA Domains Using Side Information [5.253305460558346]
Domain Generation Algorithms (DGAs) are popular methods for generating pseudo-random domain names. In recent years, machine learning based systems have been widely used to detect DGAs. We train and evaluate state-of-the-art deep learning and random forest (RF) classifiers for DGA detection using side information that is harder for adversaries to manipulate than the domain name itself.
arXiv Detail & Related papers (2020-03-12T11:00:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.