Related papers: From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories

From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories

URL: http://arxiv.org/abs/2504.16449v2
Date: Sun, 01 Jun 2025 05:32:37 GMT
Title: From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories
Authors: Ye Tian, Yanqiu Yu, Jianguo Sun, Yanbin Wang,
Abstract summary: Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems.<n>This review systematically analyzes methods from traditional blacklisting to advanced deep learning approaches.<n>Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities.
Score: 3.323388021979584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews exhibit 4 critical gaps: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g. Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (URL, HTML, Visual, etc.). This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze: 1) publicly available datasets (2016-2024), and 2) open-source implementations from published works(2013-2025). Then, we outline essential design principles and architectural frameworks for product-level implementations. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curating datasets and open-source implementations: https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master.

Related papers

Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z)
Threshold-Protected Searchable Sharing: Privacy Preserving Aggregated-ANN Search for Collaborative RAG [0.0]
Two key bottlenecks are private data repositories' locality constraints and the need to maintain compatibility with mainstream search techniques.<n>We develop a secure and privacy-preserving aggregated approximate nearest neighbor search (SP-A$2$NN) with HNSW compatibility.<n>We also explore a novel security analytical framework that incorporates privacy analysis via reductions.
arXiv Detail & Related papers (2025-07-23T04:45:01Z)
DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective [59.66984417026933]
We introduce a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing)<n>We formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset.<n>Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery.<n>Our benchmark, DATABench, comprises 17 evasion attacks, 5 forgery attacks, and 9
arXiv Detail & Related papers (2025-07-08T03:07:15Z)
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.<n>We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z)
LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights [12.424610893030353]
Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection. This paper provides a detailed survey of LLMs in vulnerability detection. We address challenges such as cross-language vulnerability detection, multimodal data integration, and repository-level analysis.
arXiv Detail & Related papers (2025-02-10T21:33:38Z)
TrustRAG: Enhancing Robustness and Trustworthiness in RAG [31.231916859341865]
TrustRAG is a framework that systematically filters compromised and irrelevant contents before they are retrieved for generation.<n>TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches.
arXiv Detail & Related papers (2025-01-01T15:57:34Z)
Hidden Data Privacy Breaches in Federated Learning [24.47236055167954]
Federated Learning (FL) emerged as a paradigm for conducting machine learning across broad and decentralized datasets.<n>Recent studies show that attackers can steal private data through model manipulation or gradient analysis.<n>We propose a novel data-reconstruction attack leveraging malicious code injection, supported by two key techniques.
arXiv Detail & Related papers (2024-11-27T12:04:37Z)
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks. We propose to automate dataset updating and provide systematic analysis regarding its effectiveness. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
Robust Recommender System: A Survey and Future Directions [58.87305602959857]
We first present a taxonomy to organize current techniques for withstanding malicious attacks and natural noise.<n>We then explore state-of-the-art methods in each category, including fraudster detection, adversarial training, certifiable robust training for defending against malicious attacks.<n>We discuss robustness across varying recommendation scenarios and its interplay with other properties like accuracy, interpretability, privacy, and fairness.
arXiv Detail & Related papers (2023-09-05T08:58:46Z)
Unsupervised Abnormal Traffic Detection through Topological Flow Analysis [1.933681537640272]
topological connectivity component of a malicious flow is less exploited. We present a simple method that facilitate the use of connectivity graph features in unsupervised anomaly detection algorithms.
arXiv Detail & Related papers (2022-05-14T18:52:49Z)
Machine Learning for Encrypted Malicious Traffic Detection: Approaches, Datasets and Comparative Study [6.267890584151111]
In post-COVID-19 environment, malicious traffic encryption is growing rapidly. We formulate a universal framework of machine learning based encrypted malicious traffic detection techniques. We implement and compare 10 encrypted malicious traffic detection algorithms.
arXiv Detail & Related papers (2022-03-17T14:00:55Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
UMAD: Universal Model Adaptation under Domain and Category Shift [138.12678159620248]
Universal Model ADaptation (UMAD) framework handles both UDA scenarios without access to source data. We develop an informative consistency score to help distinguish unknown samples from known samples. Experiments on open-set and open-partial-set UDA scenarios demonstrate that UMAD exhibits comparable, if not superior, performance to state-of-the-art data-dependent methods.
arXiv Detail & Related papers (2021-12-16T01:22:59Z)
A2Log: Attentive Augmented Log Anomaly Detection [53.06341151551106]
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services. Existing unsupervised methods need anomaly examples to obtain a suitable decision boundary. We develop A2Log, which is an unsupervised anomaly detection method consisting of two steps: Anomaly scoring and anomaly decision.
arXiv Detail & Related papers (2021-09-20T13:40:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.