URLBERT:A Contrastive and Adversarial Pre-trained Model for URL
Classification
- URL: http://arxiv.org/abs/2402.11495v1
- Date: Sun, 18 Feb 2024 07:51:20 GMT
- Title: URLBERT:A Contrastive and Adversarial Pre-trained Model for URL
Classification
- Authors: Yujie Li, Yanbin Wang, Haitao Xu, Zhenhao Guo, Zheng Cao, Lun Zhang
- Abstract summary: URLs play a crucial role in understanding and categorizing web content.
This paper introduces URLBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks.
- Score: 10.562100395816595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: URLs play a crucial role in understanding and categorizing web content,
particularly in tasks related to security control and online recommendations.
While pre-trained models are currently dominating various fields, the domain of
URL analysis still lacks specialized pre-trained models. To address this gap,
this paper introduces URLBERT, the first pre-trained representation learning
model applied to a variety of URL classification or detection tasks. We first
train a URL tokenizer on a corpus of billions of URLs to address URL data
tokenization. Additionally, we propose two novel pre-training tasks: (1)
self-supervised contrastive learning tasks, which strengthen the model's
understanding of URL structure and the capture of category differences by
distinguishing different variants of the same URL; (2) virtual adversarial
training, aimed at improving the model's robustness in extracting semantic
features from URLs. Finally, our proposed methods are evaluated on tasks
including phishing URL detection, web page classification, and ad filtering,
achieving state-of-the-art performance. Importantly, we also explore multi-task
learning with URLBERT, and experimental results demonstrate that multi-task
learning model based on URLBERT exhibit equivalent effectiveness compared to
independently fine-tuned models, showing the simplicity of URLBERT in handling
complex task requirements. The code for our work is available at
https://github.com/Davidup1/URLBERT.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification [4.585051136007553]
We introduce DomURLs_BERT, a pre-trained BERT-based encoder for detecting and classifying suspicious/malicious domains and URLs.
The proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets.
arXiv Detail & Related papers (2024-09-13T18:59:13Z) - Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data [3.2771631221674333]
We model the detection of topic-related content as a binary classification task.
Using only a few hundred annotated data points per topic, we detect content related to three German policies.
arXiv Detail & Related papers (2024-07-23T14:31:59Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - URL-BERT: Training Webpage Representations via Social Media Engagements [31.6455614291821]
We introduce a new pre-training objective that can be used to adapt LMs to understand URLs and webpages.
Our proposed framework consists of two steps: (1) scalable graph embeddings to learn shallow representations of URLs based on user engagement on social media.
We experimentally demonstrate that our continued pre-training approach improves webpage understanding on a variety of tasks and Twitter internal and external benchmarks.
arXiv Detail & Related papers (2023-10-25T02:22:50Z) - Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains [23.41494712616903]
We propose RETSINA, a novel meta-learning based framework that enables zero-day Web attack detection across different domains.
We conduct experiments using four real-world datasets on different domains with a total of 293M Web requests.
RETSINA captures on average 126 and 218 zero-day attack requests per day in two domains, respectively, in one month.
arXiv Detail & Related papers (2023-09-07T11:58:20Z) - A Shapelet-based Framework for Unsupervised Multivariate Time Series Representation Learning [29.511632089649552]
We propose a novel URL framework for multivariate time series by learning time-series-specific shapelet-based representation.
To the best of our knowledge, this is the first work that explores the shapelet-based embedding in the unsupervised general-purpose representation learning.
A unified shapelet-based encoder and a novel learning objective with multi-grained contrasting and multi-scale alignment are particularly designed to achieve our goal.
arXiv Detail & Related papers (2023-05-30T09:31:57Z) - Continual Object Detection via Prototypical Task Correlation Guided
Gating Mechanism [120.1998866178014]
We present a flexible framework for continual object detection via pRotOtypical taSk corrElaTion guided gaTingAnism (ROSETTA)
Concretely, a unified framework is shared by all tasks while task-aware gates are introduced to automatically select sub-models for specific tasks.
Experiments on COCO-VOC, KITTI-Kitchen, class-incremental detection on VOC and sequential learning of four tasks show that ROSETTA yields state-of-the-art performance.
arXiv Detail & Related papers (2022-05-06T07:31:28Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - Learning to Match Jobs with Resumes from Sparse Interaction Data using
Multi-View Co-Teaching Network [83.64416937454801]
Job-resume interaction data is sparse and noisy, which affects the performance of job-resume match algorithms.
We propose a novel multi-view co-teaching network from sparse interaction data for job-resume matching.
Our model is able to outperform state-of-the-art methods for job-resume matching.
arXiv Detail & Related papers (2020-09-25T03:09:54Z) - Low Resource Multi-Task Sequence Tagging -- Revisiting Dynamic
Conditional Random Fields [67.51177964010967]
We compare different models for low resource multi-task sequence tagging that leverage dependencies between label sequences for different tasks.
We find that explicit modeling of inter-dependencies between task predictions outperforms single-task as well as standard multi-task models.
arXiv Detail & Related papers (2020-05-01T07:11:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.