IP-Augmented Multi-Modal Malicious URL Detection Via Token-Contrastive Representation Enhancement and Multi-Granularity Fusion
- URL: http://arxiv.org/abs/2510.12395v1
- Date: Tue, 14 Oct 2025 11:20:06 GMT
- Title: IP-Augmented Multi-Modal Malicious URL Detection Via Token-Contrastive Representation Enhancement and Multi-Granularity Fusion
- Authors: Ye Tian, Yanqiu Yu, Liangliang Song, Zhiquan Liu, Yanbin Wang, Jianguo Sun,
- Abstract summary: Malicious URL detection remains a critical cybersecurity challenge.<n>We propose CURL-IP, an advanced multi-modal detection framework incorporating three key innovations.<n>Our evaluation on large-scale real-world datasets shows the framework significantly outperforms state-of-the-art baselines.
- Score: 11.704828859661879
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Malicious URL detection remains a critical cybersecurity challenge as adversaries increasingly employ sophisticated evasion techniques including obfuscation, character-level perturbations, and adversarial attacks. Although pre-trained language models (PLMs) like BERT have shown potential for URL analysis tasks, three limitations persist in current implementations: (1) inability to effectively model the non-natural hierarchical structure of URLs, (2) insufficient sensitivity to character-level obfuscation, and (3) lack of mechanisms to incorporate auxiliary network-level signals such as IP addresses-all essential for robust detection. To address these challenges, we propose CURL-IP, an advanced multi-modal detection framework incorporating three key innovations: (1) Token-Contrastive Representation Enhancer, which enhances subword token representations through token-aware contrastive learning to produce more discriminative and isotropic embeddings; (2) Cross-Layer Multi-Scale Aggregator, employing hierarchical aggregation of Transformer outputs via convolutional operations and gated MLPs to capture both local and global semantic patterns across layers; and (3) Blockwise Multi-Modal Coupler that decomposes URL-IP features into localized block units and computes cross-modal attention weights at the block level, enabling fine-grained inter-modal interaction. This architecture enables simultaneous preservation of fine-grained lexical cues, contextual semantics, and integration of network-level signals. Our evaluation on large-scale real-world datasets shows the framework significantly outperforms state-of-the-art baselines across binary and multi-class classification tasks.
Related papers
- STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification [14.549172375231729]
We propose STMI, a novel multi-modal learning framework consisting of three key components.<n>We demonstrate the effectiveness and robustness of our proposed STMI framework in multi-modal ReID scenarios.
arXiv Detail & Related papers (2026-02-28T15:07:10Z) - Statistically-Guided Dual-Domain Meta-Learning with Adaptive Multi-Prototype Aggregation for Distributed Fiber Optic Sensing [11.719957656139824]
We propose a novel meta-learning framework, DUPLE, for cross-deployment DFOS activity identification.<n>First, a dual-domain multi-prototype learner fuses temporal and frequency domain features, enhancing the model's generalization ability under signal distribution shifts.<n>Second, a Statistical Guided Network (SGN) infers domain importance and prototype sensitivity from raw statistical features, providing data-driven prior information for learning in unlabeled or unseen domains.<n>Third, a query-aware prototype aggregation module adaptively selects and combines relevant prototypes, thereby improving classification performance even with limited data.
arXiv Detail & Related papers (2025-11-22T03:39:13Z) - URL2Graph++: Unified Semantic-Structural-Character Learning for Malicious URL Detection [11.415725075802344]
Malicious URL detection remains a major challenge in cybersecurity.<n>We propose a novel malicious URL detection method combining multi-granularity graph learning with semantic embedding.<n>Results show our method exceeds SOTA performance, including against large language models.
arXiv Detail & Related papers (2025-09-12T14:27:04Z) - MIETT: Multi-Instance Encrypted Traffic Transformer for Encrypted Traffic Classification [59.96233305733875]
Classifying traffic is essential for detecting security threats and optimizing network management.<n>We propose a Multi-Instance Encrypted Traffic Transformer (MIETT) to capture both token-level and packet-level relationships.<n>MIETT achieves results across five datasets, demonstrating its effectiveness in classifying encrypted traffic and understanding complex network behaviors.
arXiv Detail & Related papers (2024-12-19T12:52:53Z) - GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration [63.112790050749695]
We introduce EAGER, a novel generative recommendation framework that seamlessly integrates both behavioral and semantic information.
We validate the effectiveness of EAGER on four public benchmarks, demonstrating its superior performance compared to existing methods.
arXiv Detail & Related papers (2024-06-20T06:21:56Z) - Multimodal Collaboration Networks for Geospatial Vehicle Detection in Dense, Occluded, and Large-Scale Events [29.86323896541765]
In large-scale disaster events, the planning of optimal rescue routes depends on the object detection ability at the disaster scene.
Existing methods, which are typically based on the RGB modality, struggle to distinguish targets with similar colors and textures in crowded environments.
We propose a multimodal collaboration network for dense and occluded vehicle detection, MuDet.
arXiv Detail & Related papers (2024-05-14T00:51:15Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Hierarchical Disentanglement-Alignment Network for Robust SAR Vehicle
Recognition [18.38295403066007]
HDANet integrates feature disentanglement and alignment into a unified framework.
The proposed method demonstrates impressive robustness across nine operating conditions in the MSTAR dataset.
arXiv Detail & Related papers (2023-04-07T09:11:29Z) - Transformer Based Multi-Grained Features for Unsupervised Person
Re-Identification [9.874360118638918]
We build a dual-branch network architecture based upon a modified Vision Transformer (ViT)
Local tokens output in each branch are reshaped and then uniformly partitioned into multiple stripes to generate part-level features.
Global tokens of two branches are averaged to produce a global feature.
arXiv Detail & Related papers (2022-11-22T13:51:17Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - I^3Net: Implicit Instance-Invariant Network for Adapting One-Stage
Object Detectors [64.93963042395976]
Implicit Instance-Invariant Network (I3Net) is tailored for adapting one-stage detectors.
I3Net implicitly learns instance-invariant features via exploiting the natural characteristics of deep features in different layers.
Experiments reveal that I3Net exceeds the state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2021-03-25T11:14:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.