URL2Graph++: Unified Semantic-Structural-Character Learning for Malicious URL Detection
- URL: http://arxiv.org/abs/2509.10287v1
- Date: Fri, 12 Sep 2025 14:27:04 GMT
- Title: URL2Graph++: Unified Semantic-Structural-Character Learning for Malicious URL Detection
- Authors: Ye Tian, Yifan Jia, Yanbin Wang, Jianguo Sun, Zhiquan Liu, Xiaowen Ling,
- Abstract summary: Malicious URL detection remains a major challenge in cybersecurity.<n>We propose a novel malicious URL detection method combining multi-granularity graph learning with semantic embedding.<n>Results show our method exceeds SOTA performance, including against large language models.
- Score: 11.415725075802344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Malicious URL detection remains a major challenge in cybersecurity, primarily due to two factors: (1) the exponential growth of the Internet has led to an immense diversity of URLs, making generalized detection increasingly difficult; and (2) attackers are increasingly employing sophisticated obfuscation techniques to evade detection. We advocate that addressing these challenges fundamentally requires: (1) obtaining semantic understanding to improve generalization across vast and diverse URL sets, and (2) accurately modeling contextual relationships within the structural composition of URLs. In this paper, we propose a novel malicious URL detection method combining multi-granularity graph learning with semantic embedding to jointly capture semantic, character-level, and structural features for robust URL analysis. To model internal dependencies within URLs, we first construct dual-granularity URL graphs at both subword and character levels, where nodes represent URL tokens/characters and edges encode co-occurrence relationships. To obtain fine-grained embeddings, we initialize node representations using a character-level convolutional network. The two graphs are then processed through jointly trained Graph Convolutional Networks to learn consistent graph-level representations, enabling the model to capture complementary structural features that reflect co-occurrence patterns and character-level dependencies. Furthermore, we employ BERT to derive semantic representations of URLs for semantically aware understanding. Finally, we introduce a gated dynamic fusion network to combine the semantically enriched BERT representations with the jointly optimized graph vectors, further enhancing detection performance. We extensively evaluate our method across multiple challenging dimensions. Results show our method exceeds SOTA performance, including against large language models.
Related papers
- OWLEYE: Zero-Shot Learner for Cross-Domain Graph Data Anomaly Detection [48.77471686671269]
OWLEYE is a novel framework that learns transferable patterns of normal behavior from multiple graphs.<n>We show that OWLEYE achieves superior performance and generalizability compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-01-27T02:08:18Z) - IP-Augmented Multi-Modal Malicious URL Detection Via Token-Contrastive Representation Enhancement and Multi-Granularity Fusion [11.704828859661879]
Malicious URL detection remains a critical cybersecurity challenge.<n>We propose CURL-IP, an advanced multi-modal detection framework incorporating three key innovations.<n>Our evaluation on large-scale real-world datasets shows the framework significantly outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-10-14T11:20:06Z) - WebGuard++:Interpretable Malicious URL Detection via Bidirectional Fusion of HTML Subgraphs and Multi-Scale Convolutional BERT [3.6220178465092503]
URL+ HTML feature fusion shows promise for robust malicious URL detection, since attacker artifacts persist in DOM structures.<n>We present WebGuard++, a detection framework with 4 novel components.<n> Experiments show WebGuard++ achieves significant improvements over state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-24T06:36:51Z) - Efficient Phishing URL Detection Using Graph-based Machine Learning and Loopy Belief Propagation [12.89058029173131]
We propose a graph-based machine learning model for phishing URL detection.<n>We integrate URL structure and network-level features such as IP addresses and authoritative name servers.<n>Experiments on real-world datasets demonstrate our model's effectiveness by achieving F1 score of up to 98.77%.
arXiv Detail & Related papers (2025-01-12T19:49:00Z) - GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - Unleashing the Potential of Text-attributed Graphs: Automatic Relation Decomposition via Large Language Models [31.443478448031886]
RoSE (Relation-oriented Semantic Edge-decomposition) is a novel framework that decomposes the graph structure by analyzing raw text attributes.
Our framework significantly enhances node classification performance across various datasets, with improvements of up to 16% on the Wisconsin dataset.
arXiv Detail & Related papers (2024-05-28T20:54:47Z) - RAGFormer: Learning Semantic Attributes and Topological Structure for Fraud Detection [8.050935113945428]
We present a novel framework called Relation-Aware GNN with transFormer(RAGFormer)<n>RAGFormer embeds both semantic and topological features into a target node.<n>The simple yet effective network consists of a semantic encoder, a topology encoder, and an attention fusion module.
arXiv Detail & Related papers (2024-02-27T12:53:15Z) - Continuous Multi-Task Pre-training for Malicious URL Detection and Webpage Classification [6.8847203112253235]
Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management.<n>We propose urlBERT, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs.<n>We evaluate it on three downstream tasks: phishing URL detection, advertising URL detection, and webpage classification.
arXiv Detail & Related papers (2024-02-18T07:51:20Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Learning the Implicit Semantic Representation on Graph-Structured Data [57.670106959061634]
Existing representation learning methods in graph convolutional networks are mainly designed by describing the neighborhood of each node as a perceptual whole.
We propose a Semantic Graph Convolutional Networks (SGCN) that explores the implicit semantics by learning latent semantic-paths in graphs.
arXiv Detail & Related papers (2021-01-16T16:18:43Z) - Keyphrase Extraction with Dynamic Graph Convolutional Networks and
Diversified Inference [50.768682650658384]
Keyphrase extraction (KE) aims to summarize a set of phrases that accurately express a concept or a topic covered in a given document.
Recent Sequence-to-Sequence (Seq2Seq) based generative framework is widely used in KE task, and it has obtained competitive performance on various benchmarks.
In this paper, we propose to adopt the Dynamic Graph Convolutional Networks (DGCN) to solve the above two problems simultaneously.
arXiv Detail & Related papers (2020-10-24T08:11:23Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.