From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories
- URL: http://arxiv.org/abs/2504.16449v1
- Date: Wed, 23 Apr 2025 06:23:18 GMT
- Title: From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories
- Authors: Ye Tian, Yanqiu Yu, Jianguo Sun, Yanbin Wang,
- Abstract summary: Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems.<n>This review systematically analyzes methods from traditional blacklisting to advanced deep learning approaches.<n>Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities.
- Score: 3.323388021979584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews exhibit 4 critical gaps: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g. Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (URL, HTML, Visual, etc.). This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze: 1) publicly available datasets (2016-2024), and 2) open-source implementations from published works(2013-2025). Then, we outline essential design principles and architectural frameworks for product-level implementations. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curating datasets and open-source implementations: https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master.
Related papers
- How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.<n>We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z) - LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights [12.424610893030353]
Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection.
This paper provides a detailed survey of LLMs in vulnerability detection.
We address challenges such as cross-language vulnerability detection, multimodal data integration, and repository-level analysis.
arXiv Detail & Related papers (2025-02-10T21:33:38Z) - TrustRAG: Enhancing Robustness and Trustworthiness in RAG [31.231916859341865]
TrustRAG is a framework that systematically filters compromised and irrelevant contents before they are retrieved for generation.<n>TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches.
arXiv Detail & Related papers (2025-01-01T15:57:34Z) - Hidden Data Privacy Breaches in Federated Learning [24.47236055167954]
Federated Learning (FL) emerged as a paradigm for conducting machine learning across broad and decentralized datasets.<n>Recent studies show that attackers can steal private data through model manipulation or gradient analysis.<n>We propose a novel data-reconstruction attack leveraging malicious code injection, supported by two key techniques.
arXiv Detail & Related papers (2024-11-27T12:04:37Z) - Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks.
We propose to automate dataset updating and provide systematic analysis regarding its effectiveness.
There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Robust Recommender System: A Survey and Future Directions [58.87305602959857]
We first present a taxonomy to organize current techniques for withstanding malicious attacks and natural noise.<n>We then explore state-of-the-art methods in each category, including fraudster detection, adversarial training, certifiable robust training for defending against malicious attacks.<n>We discuss robustness across varying recommendation scenarios and its interplay with other properties like accuracy, interpretability, privacy, and fairness.
arXiv Detail & Related papers (2023-09-05T08:58:46Z) - Unsupervised Abnormal Traffic Detection through Topological Flow
Analysis [1.933681537640272]
topological connectivity component of a malicious flow is less exploited.
We present a simple method that facilitate the use of connectivity graph features in unsupervised anomaly detection algorithms.
arXiv Detail & Related papers (2022-05-14T18:52:49Z) - Machine Learning for Encrypted Malicious Traffic Detection: Approaches,
Datasets and Comparative Study [6.267890584151111]
In post-COVID-19 environment, malicious traffic encryption is growing rapidly.
We formulate a universal framework of machine learning based encrypted malicious traffic detection techniques.
We implement and compare 10 encrypted malicious traffic detection algorithms.
arXiv Detail & Related papers (2022-03-17T14:00:55Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - UMAD: Universal Model Adaptation under Domain and Category Shift [138.12678159620248]
Universal Model ADaptation (UMAD) framework handles both UDA scenarios without access to source data.
We develop an informative consistency score to help distinguish unknown samples from known samples.
Experiments on open-set and open-partial-set UDA scenarios demonstrate that UMAD exhibits comparable, if not superior, performance to state-of-the-art data-dependent methods.
arXiv Detail & Related papers (2021-12-16T01:22:59Z) - A2Log: Attentive Augmented Log Anomaly Detection [53.06341151551106]
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services.
Existing unsupervised methods need anomaly examples to obtain a suitable decision boundary.
We develop A2Log, which is an unsupervised anomaly detection method consisting of two steps: Anomaly scoring and anomaly decision.
arXiv Detail & Related papers (2021-09-20T13:40:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.