Classification of URL bitstreams using Bag of Bytes
- URL: http://arxiv.org/abs/2111.06087v1
- Date: Thu, 11 Nov 2021 07:43:45 GMT
- Title: Classification of URL bitstreams using Bag of Bytes
- Authors: Keiichi Shima, Daisuke Miyamoto, Hiroshi Abe, Tomohiro Ishihara,
Kazuya Okada, Yuji Sekiya, Hirochika Asai, Yusuke Doi
- Abstract summary: In this paper, we apply a mechanical approach to generate feature vectors from URL strings.
Our approach achieved 23% better accuracy compared to the existing DL-based approach.
- Score: 3.2204506933585026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protecting users from accessing malicious web sites is one of the important
management tasks for network operators. There are many open-source and
commercial products to control web sites users can access. The most traditional
approach is blacklist-based filtering. This mechanism is simple but not
scalable, though there are some enhanced approaches utilizing fuzzy matching
technologies. Other approaches try to use machine learning (ML) techniques by
extracting features from URL strings. This approach can cover a wider area of
Internet web sites, but finding good features requires deep knowledge of trends
of web site design. Recently, another approach using deep learning (DL) has
appeared. The DL approach will help to extract features automatically by
investigating a lot of existing sample data. Using this technique, we can build
a flexible filtering decision module by keep teaching the neural network module
about recent trends, without any specific expert knowledge of the URL domain.
In this paper, we apply a mechanical approach to generate feature vectors from
URL strings. We implemented our approach and tested with realistic URL access
history data taken from a research organization and data from the famous
archive site of phishing site information, PhishTank.com. Our approach achieved
2~3% better accuracy compared to the existing DL-based approach.
Related papers
- Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.
We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.
We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - CRATOR: a Dark Web Crawler [1.7224362150588657]
This study proposes a general dark web crawler designed to extract pages handling security protocols, such as captchas.
Our approach uses a combination of seed URL lists, link analysis, and scanning to discover new content.
arXiv Detail & Related papers (2024-05-10T09:39:12Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - CDFSL-V: Cross-Domain Few-Shot Learning for Videos [58.37446811360741]
Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples.
Existing methods in video action recognition rely on large labeled datasets from the same domain.
We propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning.
arXiv Detail & Related papers (2023-09-07T19:44:27Z) - Learning to Identify Critical States for Reinforcement Learning from
Videos [55.75825780842156]
Algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions.
For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences.
A DRL machine that wants to profit from watching such videos must first learn by itself to identify and recognize relevant states/actions/rewards.
arXiv Detail & Related papers (2023-08-15T14:21:24Z) - Many or Few Samples? Comparing Transfer, Contrastive and Meta-Learning
in Encrypted Traffic Classification [68.19713459228369]
We compare transfer learning, meta-learning and contrastive learning against reference Machine Learning (ML) tree-based and monolithic DL models.
We show that (i) using large datasets we can obtain more general representations, (ii) contrastive learning is the best methodology.
While ML tree-based cannot handle large tasks but fits well small tasks, by means of reusing learned representations, DL methods are reaching tree-based models performance also for small tasks.
arXiv Detail & Related papers (2023-05-21T11:20:49Z) - Web Content Filtering through knowledge distillation of Large Language
Models [1.7446104539598901]
We introduce a state-of-the-art approach for URL categorization that leverages the power of Large Language Models (LLMs)
Our method utilizes LLMs to generate accurate classifications and then employs established knowledge distillation techniques to create smaller, more specialized student models tailored for web content filtering.
Our student model matches the performance of the teacher LLM with 175 times less parameters, allowing the model to be used for in-line scanning of large volumes of URLs.
arXiv Detail & Related papers (2023-05-08T20:09:27Z) - An Adversarial Attack Analysis on Malicious Advertisement URL Detection
Framework [22.259444589459513]
Malicious advertisement URLs pose a security risk since they are the source of cyber-attacks.
Existing malicious URL detection techniques are limited and to handle unseen features as well as generalize to test data.
In this study, we extract a novel set of lexical and web-scrapped features and employ machine learning technique to set up system for fraudulent advertisement URLs detection.
arXiv Detail & Related papers (2022-04-27T20:06:22Z) - PhishMatch: A Layered Approach for Effective Detection of Phishing URLs [8.658596218544774]
We present a layered anti-phishing defense, PhishMatch, which is robust, accurate, inexpensive, and client-side.
A prototype plugin of PhishMatch, developed for the Chrome browser, was found to be fast and lightweight.
arXiv Detail & Related papers (2021-12-04T03:21:29Z) - Masked LARk: Masked Learning, Aggregation and Reporting worKflow [6.484847460164177]
Many web advertising data flows involve passive cross-site tracking of users.
Most browsers are moving towards removal of 3PC in subsequent browser iterations.
We propose a new proposal, called Masked LARk, for aggregation of user engagement measurement and model training.
arXiv Detail & Related papers (2021-10-27T21:59:37Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.