Continuous Multi-Task Pre-training for Malicious URL Detection and Webpage Classification
- URL: http://arxiv.org/abs/2402.11495v2
- Date: Sat, 24 May 2025 08:18:17 GMT
- Title: Continuous Multi-Task Pre-training for Malicious URL Detection and Webpage Classification
- Authors: Yujie Li, Yiwei Liu, Peiyue Li, Yifan Jia, Yanbin Wang,
- Abstract summary: Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management.<n>We propose urlBERT, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs.<n>We evaluate it on three downstream tasks: phishing URL detection, advertising URL detection, and webpage classification.
- Score: 6.8847203112253235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management. In recent years, extensive research has explored using BERT or similar language models to replace traditional machine learning methods for detecting malicious URLs and classifying webpages. While previous studies show promising results, they often apply existing language models to these tasks without accounting for the inherent differences in domain data (e.g., URLs being loosely structured and semantically sparse compared to text), leaving room for performance improvement. Furthermore, current approaches focus on single tasks and have not been tested in multi-task scenarios. To address these challenges, we propose urlBERT, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs. To achieve it, we propose to use 5 unsupervised pretraining tasks to capture multi-level information of URL lexical, syntax, and semantics, and generate contrastive and adversarial representations. Furthermore, to avoid inter-pre-training competition and interference, we proposed a grouped sequential learning method to ensure effective training across multi-tasks. Finally, we leverage a two-stage fine-tuning approach to improve the training stability and efficiency of the task model. To assess the multitasking potential of urlBERT, we fine-tune the task model in both single-task and multi-task modes. The former creates a classification model for a single task, while the latter builds a classification model capable of handling multiple tasks. We evaluate urlBERT on three downstream tasks: phishing URL detection, advertising URL detection, and webpage classification. The results demonstrate that urlBERT outperforms standard pre-trained models, and its multi-task mode is capable of addressing the real-world demands of multitasking.
Related papers
- A New Dataset and Methodology for Malicious URL Classification [2.835223467109843]
Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats.
Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models.
We introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench.
arXiv Detail & Related papers (2024-12-31T09:10:38Z) - Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification [4.585051136007553]
We introduce DomURLs_BERT, a pre-trained BERT-based encoder for detecting and classifying suspicious/malicious domains and URLs.
The proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets.
arXiv Detail & Related papers (2024-09-13T18:59:13Z) - Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data [3.2771631221674333]
We model the detection of topic-related content as a binary classification task.
Using only a few hundred annotated data points per topic, we detect content related to three German policies.
arXiv Detail & Related papers (2024-07-23T14:31:59Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network [16.73322002436809]
We propose PMANet, a pre-trained Language Model-Guided multi-level feature attention network.
PMANet employs a post-training process with three self-supervised objectives: masked language modeling, noisy language modeling, and domain discrimination.
Experiments on diverse scenarios, including small-scale data, class imbalance, and adversarial attacks, demonstrate PMANet's superiority over state-of-the-art models.
arXiv Detail & Related papers (2023-11-21T06:23:08Z) - URL-BERT: Training Webpage Representations via Social Media Engagements [31.6455614291821]
We introduce a new pre-training objective that can be used to adapt LMs to understand URLs and webpages.
Our proposed framework consists of two steps: (1) scalable graph embeddings to learn shallow representations of URLs based on user engagement on social media.
We experimentally demonstrate that our continued pre-training approach improves webpage understanding on a variety of tasks and Twitter internal and external benchmarks.
arXiv Detail & Related papers (2023-10-25T02:22:50Z) - Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains [23.41494712616903]
We propose RETSINA, a novel meta-learning based framework that enables zero-day Web attack detection across different domains.
We conduct experiments using four real-world datasets on different domains with a total of 293M Web requests.
RETSINA captures on average 126 and 218 zero-day attack requests per day in two domains, respectively, in one month.
arXiv Detail & Related papers (2023-09-07T11:58:20Z) - A Shapelet-based Framework for Unsupervised Multivariate Time Series Representation Learning [29.511632089649552]
We propose a novel URL framework for multivariate time series by learning time-series-specific shapelet-based representation.
To the best of our knowledge, this is the first work that explores the shapelet-based embedding in the unsupervised general-purpose representation learning.
A unified shapelet-based encoder and a novel learning objective with multi-grained contrasting and multi-scale alignment are particularly designed to achieve our goal.
arXiv Detail & Related papers (2023-05-30T09:31:57Z) - Multi-Task Models Adversarial Attacks [25.834775498006657]
Multi-Task Learning involves developing a singular model, known as a multi-task model, to concurrently perform multiple tasks.
The security of single-task models has been thoroughly studied, but multi-task models pose several critical security questions.
This paper addresses these queries through detailed analysis and rigorous experimentation.
arXiv Detail & Related papers (2023-05-20T03:07:43Z) - Algorithm Design for Online Meta-Learning with Task Boundary Detection [63.284263611646]
We propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments.
We first propose two simple but effective detection mechanisms of task switches and distribution shift.
We show that a sublinear task-averaged regret can be achieved for our algorithm under mild conditions.
arXiv Detail & Related papers (2023-02-02T04:02:49Z) - ZhichunRoad at Amazon KDD Cup 2022: MultiTask Pre-Training for
E-Commerce Product Search [4.220439000486713]
We propose a robust multilingual model to improve the quality of search results.
In pre-training stage, we adopt mlm task, classification task and contrastive learning task.
In fine-tuning stage, we use confident learning, exponential moving average method (EMA), adversarial training (FGM) and regularized dropout strategy (R-Drop)
arXiv Detail & Related papers (2023-01-31T07:31:34Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Continual Object Detection via Prototypical Task Correlation Guided
Gating Mechanism [120.1998866178014]
We present a flexible framework for continual object detection via pRotOtypical taSk corrElaTion guided gaTingAnism (ROSETTA)
Concretely, a unified framework is shared by all tasks while task-aware gates are introduced to automatically select sub-models for specific tasks.
Experiments on COCO-VOC, KITTI-Kitchen, class-incremental detection on VOC and sequential learning of four tasks show that ROSETTA yields state-of-the-art performance.
arXiv Detail & Related papers (2022-05-06T07:31:28Z) - Improving Multi-task Generalization Ability for Neural Text Matching via
Prompt Learning [54.66399120084227]
Recent state-of-the-art neural text matching models (PLMs) are hard to generalize to different tasks.
We adopt a specialization-generalization training strategy and refer to it as Match-Prompt.
In specialization stage, descriptions of different matching tasks are mapped to only a few prompt tokens.
In generalization stage, text matching model explores the essential matching signals by being trained on diverse multiple matching tasks.
arXiv Detail & Related papers (2022-04-06T11:01:08Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - Learning to Match Jobs with Resumes from Sparse Interaction Data using
Multi-View Co-Teaching Network [83.64416937454801]
Job-resume interaction data is sparse and noisy, which affects the performance of job-resume match algorithms.
We propose a novel multi-view co-teaching network from sparse interaction data for job-resume matching.
Our model is able to outperform state-of-the-art methods for job-resume matching.
arXiv Detail & Related papers (2020-09-25T03:09:54Z) - Low Resource Multi-Task Sequence Tagging -- Revisiting Dynamic
Conditional Random Fields [67.51177964010967]
We compare different models for low resource multi-task sequence tagging that leverage dependencies between label sequences for different tasks.
We find that explicit modeling of inter-dependencies between task predictions outperforms single-task as well as standard multi-task models.
arXiv Detail & Related papers (2020-05-01T07:11:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.