Related papers: URLBERT:A Contrastive and Adversarial Pre-trained Model for URL Classification

URLBERT:A Contrastive and Adversarial Pre-trained Model for URL Classification

URL: http://arxiv.org/abs/2402.11495v1
Date: Sun, 18 Feb 2024 07:51:20 GMT
Title: URLBERT:A Contrastive and Adversarial Pre-trained Model for URL Classification
Authors: Yujie Li, Yanbin Wang, Haitao Xu, Zhenhao Guo, Zheng Cao, Lun Zhang
Abstract summary: URLs play a crucial role in understanding and categorizing web content. This paper introduces URLBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks.
Score: 10.562100395816595
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: URLs play a crucial role in understanding and categorizing web content, particularly in tasks related to security control and online recommendations. While pre-trained models are currently dominating various fields, the domain of URL analysis still lacks specialized pre-trained models. To address this gap, this paper introduces URLBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks. We first train a URL tokenizer on a corpus of billions of URLs to address URL data tokenization. Additionally, we propose two novel pre-training tasks: (1) self-supervised contrastive learning tasks, which strengthen the model's understanding of URL structure and the capture of category differences by distinguishing different variants of the same URL; (2) virtual adversarial training, aimed at improving the model's robustness in extracting semantic features from URLs. Finally, our proposed methods are evaluated on tasks including phishing URL detection, web page classification, and ad filtering, achieving state-of-the-art performance. Importantly, we also explore multi-task learning with URLBERT, and experimental results demonstrate that multi-task learning model based on URLBERT exhibit equivalent effectiveness compared to independently fine-tuned models, showing the simplicity of URLBERT in handling complex task requirements. The code for our work is available at https://github.com/Davidup1/URLBERT.

Related papers

A New Dataset and Methodology for Malicious URL Classification [2.835223467109843]
Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models. We introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench.
arXiv Detail & Related papers (2024-12-31T09:10:38Z)
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance. We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z)
DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification [4.585051136007553]
We introduce DomURLs_BERT, a pre-trained BERT-based encoder for detecting and classifying suspicious/malicious domains and URLs. The proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets.
arXiv Detail & Related papers (2024-09-13T18:59:13Z)
Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data [3.2771631221674333]
We model the detection of topic-related content as a binary classification task. Using only a few hundred annotated data points per topic, we detect content related to three German policies.
arXiv Detail & Related papers (2024-07-23T14:31:59Z)
Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z)
PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network [16.73322002436809]
We propose PMANet, a pre-trained Language Model-Guided multi-level feature attention network. PMANet employs a post-training process with three self-supervised objectives: masked language modeling, noisy language modeling, and domain discrimination. Experiments on diverse scenarios, including small-scale data, class imbalance, and adversarial attacks, demonstrate PMANet's superiority over state-of-the-art models.
arXiv Detail & Related papers (2023-11-21T06:23:08Z)
URL-BERT: Training Webpage Representations via Social Media Engagements [31.6455614291821]
We introduce a new pre-training objective that can be used to adapt LMs to understand URLs and webpages. Our proposed framework consists of two steps: (1) scalable graph embeddings to learn shallow representations of URLs based on user engagement on social media. We experimentally demonstrate that our continued pre-training approach improves webpage understanding on a variety of tasks and Twitter internal and external benchmarks.
arXiv Detail & Related papers (2023-10-25T02:22:50Z)
Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains [23.41494712616903]
We propose RETSINA, a novel meta-learning based framework that enables zero-day Web attack detection across different domains. We conduct experiments using four real-world datasets on different domains with a total of 293M Web requests. RETSINA captures on average 126 and 218 zero-day attack requests per day in two domains, respectively, in one month.
arXiv Detail & Related papers (2023-09-07T11:58:20Z)
A Shapelet-based Framework for Unsupervised Multivariate Time Series Representation Learning [29.511632089649552]
We propose a novel URL framework for multivariate time series by learning time-series-specific shapelet-based representation. To the best of our knowledge, this is the first work that explores the shapelet-based embedding in the unsupervised general-purpose representation learning. A unified shapelet-based encoder and a novel learning objective with multi-grained contrasting and multi-scale alignment are particularly designed to achieve our goal.
arXiv Detail & Related papers (2023-05-30T09:31:57Z)
Multi-Task Models Adversarial Attacks [25.834775498006657]
Multi-Task Learning involves developing a singular model, known as a multi-task model, to concurrently perform multiple tasks. The security of single-task models has been thoroughly studied, but multi-task models pose several critical security questions. This paper addresses these queries through detailed analysis and rigorous experimentation.
arXiv Detail & Related papers (2023-05-20T03:07:43Z)
Algorithm Design for Online Meta-Learning with Task Boundary Detection [63.284263611646]
We propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments. We first propose two simple but effective detection mechanisms of task switches and distribution shift. We show that a sublinear task-averaged regret can be achieved for our algorithm under mild conditions.
arXiv Detail & Related papers (2023-02-02T04:02:49Z)
ZhichunRoad at Amazon KDD Cup 2022: MultiTask Pre-Training for E-Commerce Product Search [4.220439000486713]
We propose a robust multilingual model to improve the quality of search results. In pre-training stage, we adopt mlm task, classification task and contrastive learning task. In fine-tuning stage, we use confident learning, exponential moving average method (EMA), adversarial training (FGM) and regularized dropout strategy (R-Drop)
arXiv Detail & Related papers (2023-01-31T07:31:34Z)
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph. Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z)
Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism [120.1998866178014]
We present a flexible framework for continual object detection via pRotOtypical taSk corrElaTion guided gaTingAnism (ROSETTA) Concretely, a unified framework is shared by all tasks while task-aware gates are introduced to automatically select sub-models for specific tasks. Experiments on COCO-VOC, KITTI-Kitchen, class-incremental detection on VOC and sequential learning of four tasks show that ROSETTA yields state-of-the-art performance.
arXiv Detail & Related papers (2022-05-06T07:31:28Z)
Improving Multi-task Generalization Ability for Neural Text Matching via Prompt Learning [54.66399120084227]
Recent state-of-the-art neural text matching models (PLMs) are hard to generalize to different tasks. We adopt a specialization-generalization training strategy and refer to it as Match-Prompt. In specialization stage, descriptions of different matching tasks are mapped to only a few prompt tokens. In generalization stage, text matching model explores the essential matching signals by being trained on diverse multiple matching tasks.
arXiv Detail & Related papers (2022-04-06T11:01:08Z)
One to Many: Adaptive Instrument Segmentation via Meta Learning and Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery. It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm. It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z)
Learning to Match Jobs with Resumes from Sparse Interaction Data using Multi-View Co-Teaching Network [83.64416937454801]
Job-resume interaction data is sparse and noisy, which affects the performance of job-resume match algorithms. We propose a novel multi-view co-teaching network from sparse interaction data for job-resume matching. Our model is able to outperform state-of-the-art methods for job-resume matching.
arXiv Detail & Related papers (2020-09-25T03:09:54Z)
Low Resource Multi-Task Sequence Tagging -- Revisiting Dynamic Conditional Random Fields [67.51177964010967]
We compare different models for low resource multi-task sequence tagging that leverage dependencies between label sequences for different tasks. We find that explicit modeling of inter-dependencies between task predictions outperforms single-task as well as standard multi-task models.
arXiv Detail & Related papers (2020-05-01T07:11:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.