Related papers: Fingerprinting web servers through Transformer-encoded HTTP response headers

Fingerprinting web servers through Transformer-encoded HTTP response headers

URL: http://arxiv.org/abs/2404.00056v1
Date: Tue, 26 Mar 2024 17:24:28 GMT
Title: Fingerprinting web servers through Transformer-encoded HTTP response headers
Authors: Patrick Darwinkel,
Abstract summary: We leverage state-of-the-art deep learning, big data, and natural language processing to enhance the detection of vulnerable web server versions. We conducted experiments by sending various ambiguous and non-standard HTTP requests to 4.77 million domains.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We explored leveraging state-of-the-art deep learning, big data, and natural language processing to enhance the detection of vulnerable web server versions. Focusing on improving accuracy and specificity over rule-based systems, we conducted experiments by sending various ambiguous and non-standard HTTP requests to 4.77 million domains and capturing HTTP response status lines. We represented these status lines through training a BPE tokenizer and RoBERTa encoder for unsupervised masked language modeling. We then dimensionality reduced and concatenated encoded response lines to represent each domain's web server. A Random Forest and multilayer perceptron (MLP) classified these web servers, and achieved 0.94 and 0.96 macro F1-score, respectively, on detecting the five most popular origin web servers. The MLP achieved a weighted F1-score of 0.55 on classifying 347 major type and minor version pairs. Analysis indicates that our test cases are meaningful discriminants of web server types. Our approach demonstrates promise as a powerful and flexible alternative to rule-based systems.

Related papers

Training Large Language Models for Advanced Typosquatting Detection [0.0]
Typosquatting is a cyber threat that exploits human error in typing URLs to deceive users, distribute malware, and conduct phishing attacks. This study introduces a novel approach leveraging large language models (LLMs) to enhance typosquatting detection. Experimental results indicate that the Phi-4 14B model outperformed other tested models when properly fine tuned achieving a 98% accuracy rate with only a few thousand training samples.
arXiv Detail & Related papers (2025-03-28T13:16:27Z)
ChatHTTPFuzz: Large Language Model-Assisted IoT HTTP Fuzzing [18.095573835226787]
Internet of Things (IoT) devices offer convenience through web interfaces, web VPNs, and other web-based services, all relying on the HTTP protocol. Most state-of-the-art tools still rely on random mutation trategies, leading to difficulties in accurately understanding the HTTP protocol's structure and generating many invalid test cases. We propose a novel LLM-guided IoT HTTP fuzzing method, ChatHTTPFuzz, which automatically parses protocol fields and analyzes service code logic to generate protocol-compliant test cases.
arXiv Detail & Related papers (2024-11-18T10:48:53Z)
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website. We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z)
Beyond the Request: Harnessing HTTP Response Headers for Cross-Browser Web Tracker Classification in an Imbalanced Setting [0.0]
This study endeavors to design effective machine learning classifiers for web tracker detection using binarized HTTP response headers. Ten supervised models were trained on Chrome data and tested across all browsers, including a Chrome dataset from a year later. Results demonstrated high accuracy, F1-score, precision, recall, and minimal log-loss error for Chrome and Firefox, but subpar performance on Brave.
arXiv Detail & Related papers (2024-02-02T09:07:09Z)
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots. We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z)
PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network [16.73322002436809]
We propose PMANet, a pre-trained Language Model-Guided multi-level feature attention network. PMANet employs a post-training process with three self-supervised objectives: masked language modeling, noisy language modeling, and domain discrimination. Experiments on diverse scenarios, including small-scale data, class imbalance, and adversarial attacks, demonstrate PMANet's superiority over state-of-the-art models.
arXiv Detail & Related papers (2023-11-21T06:23:08Z)
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)
Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z)
Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks. We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z)
HTTP2vec: Embedding of HTTP Requests for Detection of Anomalous Traffic [0.0]
We propose an unsupervised language representation model for embedding HTTP requests and then using it to classify anomalies in the traffic. The solution is motivated by methods used in Natural Language Processing (NLP) such as Doc2Vec. To verify how the solution would work in real word conditions, we train the model using only legitimate traffic.
arXiv Detail & Related papers (2021-08-03T21:53:31Z)
A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging [10.609715843964263]
We study how a state-of-the-art tagging model trained on different genres performs on Web content from unfiltered Reddit forum discussions. Our results show that even small amounts of in-domain data can outperform the contribution of data from other Web domains.
arXiv Detail & Related papers (2020-04-29T16:36:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.