Related papers: Classification of URL bitstreams using Bag of Bytes

Classification of URL bitstreams using Bag of Bytes

URL: http://arxiv.org/abs/2111.06087v1
Date: Thu, 11 Nov 2021 07:43:45 GMT
Title: Classification of URL bitstreams using Bag of Bytes
Authors: Keiichi Shima, Daisuke Miyamoto, Hiroshi Abe, Tomohiro Ishihara, Kazuya Okada, Yuji Sekiya, Hirochika Asai, Yusuke Doi
Abstract summary: In this paper, we apply a mechanical approach to generate feature vectors from URL strings. Our approach achieved 23% better accuracy compared to the existing DL-based approach.
Score: 3.2204506933585026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Protecting users from accessing malicious web sites is one of the important management tasks for network operators. There are many open-source and commercial products to control web sites users can access. The most traditional approach is blacklist-based filtering. This mechanism is simple but not scalable, though there are some enhanced approaches utilizing fuzzy matching technologies. Other approaches try to use machine learning (ML) techniques by extracting features from URL strings. This approach can cover a wider area of Internet web sites, but finding good features requires deep knowledge of trends of web site design. Recently, another approach using deep learning (DL) has appeared. The DL approach will help to extract features automatically by investigating a lot of existing sample data. Using this technique, we can build a flexible filtering decision module by keep teaching the neural network module about recent trends, without any specific expert knowledge of the URL domain. In this paper, we apply a mechanical approach to generate feature vectors from URL strings. We implemented our approach and tested with realistic URL access history data taken from a research organization and data from the famous archive site of phishing site information, PhishTank.com. Our approach achieved 2~3% better accuracy compared to the existing DL-based approach.

Related papers

Client-Side Zero-Shot LLM Inference for Comprehensive In-Browser URL Analysis [0.0]
Malicious websites and phishing URLs pose an ever-increasing cybersecurity risk.<n>Traditional detection approaches rely on machine learning.<n>We propose a novel client-side framework for comprehensive URL analysis.
arXiv Detail & Related papers (2025-06-04T07:47:23Z)
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories [3.323388021979584]
Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. This review systematically analyzes methods from traditional blacklisting to advanced deep learning approaches. Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities.
arXiv Detail & Related papers (2025-04-23T06:23:18Z)
Dynamic Analysis and Adaptive Discriminator for Fake News Detection [59.41431561403343]
We propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection. For knowledge-based methods, we introduce the Monte Carlo Tree Search algorithm to leverage the self-reflective capabilities of large language models. For semantic-based methods, we define four typical deceit patterns to reveal the mechanisms behind fake news creation.
arXiv Detail & Related papers (2024-08-20T14:13:54Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
CRATOR: a Dark Web Crawler [1.7224362150588657]
This study proposes a general dark web crawler designed to extract pages handling security protocols, such as captchas. Our approach uses a combination of seed URL lists, link analysis, and scanning to discover new content.
arXiv Detail & Related papers (2024-05-10T09:39:12Z)
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website. We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z)
CDFSL-V: Cross-Domain Few-Shot Learning for Videos [58.37446811360741]
Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples. Existing methods in video action recognition rely on large labeled datasets from the same domain. We propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning.
arXiv Detail & Related papers (2023-09-07T19:44:27Z)
Learning to Identify Critical States for Reinforcement Learning from Videos [55.75825780842156]
Algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions. For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences. A DRL machine that wants to profit from watching such videos must first learn by itself to identify and recognize relevant states/actions/rewards.
arXiv Detail & Related papers (2023-08-15T14:21:24Z)
Many or Few Samples? Comparing Transfer, Contrastive and Meta-Learning in Encrypted Traffic Classification [68.19713459228369]
We compare transfer learning, meta-learning and contrastive learning against reference Machine Learning (ML) tree-based and monolithic DL models. We show that (i) using large datasets we can obtain more general representations, (ii) contrastive learning is the best methodology. While ML tree-based cannot handle large tasks but fits well small tasks, by means of reusing learned representations, DL methods are reaching tree-based models performance also for small tasks.
arXiv Detail & Related papers (2023-05-21T11:20:49Z)
Web Content Filtering through knowledge distillation of Large Language Models [1.7446104539598901]
We introduce a state-of-the-art approach for URL categorization that leverages the power of Large Language Models (LLMs) Our method utilizes LLMs to generate accurate classifications and then employs established knowledge distillation techniques to create smaller, more specialized student models tailored for web content filtering. Our student model matches the performance of the teacher LLM with 175 times less parameters, allowing the model to be used for in-line scanning of large volumes of URLs.
arXiv Detail & Related papers (2023-05-08T20:09:27Z)
An Adversarial Attack Analysis on Malicious Advertisement URL Detection Framework [22.259444589459513]
Malicious advertisement URLs pose a security risk since they are the source of cyber-attacks. Existing malicious URL detection techniques are limited and to handle unseen features as well as generalize to test data. In this study, we extract a novel set of lexical and web-scrapped features and employ machine learning technique to set up system for fraudulent advertisement URLs detection.
arXiv Detail & Related papers (2022-04-27T20:06:22Z)
PhishMatch: A Layered Approach for Effective Detection of Phishing URLs [8.658596218544774]
We present a layered anti-phishing defense, PhishMatch, which is robust, accurate, inexpensive, and client-side. A prototype plugin of PhishMatch, developed for the Chrome browser, was found to be fast and lightweight.
arXiv Detail & Related papers (2021-12-04T03:21:29Z)
Masked LARk: Masked Learning, Aggregation and Reporting worKflow [6.484847460164177]
Many web advertising data flows involve passive cross-site tracking of users. Most browsers are moving towards removal of 3PC in subsequent browser iterations. We propose a new proposal, called Masked LARk, for aggregation of user engagement measurement and model training.
arXiv Detail & Related papers (2021-10-27T21:59:37Z)
Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content. We extract and analyze the similarity between the two audio and visual modalities from within the same video. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.