Beyond the Request: Harnessing HTTP Response Headers for Cross-Browser Web Tracker Classification in an Imbalanced Setting
- URL: http://arxiv.org/abs/2402.01240v3
- Date: Mon, 23 Sep 2024 11:33:15 GMT
- Title: Beyond the Request: Harnessing HTTP Response Headers for Cross-Browser Web Tracker Classification in an Imbalanced Setting
- Authors: Wolf Rieder, Philip Raschke, Thomas Cory,
- Abstract summary: This study endeavors to design effective machine learning classifiers for web tracker detection using binarized HTTP response headers.
Ten supervised models were trained on Chrome data and tested across all browsers, including a Chrome dataset from a year later.
Results demonstrated high accuracy, F1-score, precision, recall, and minimal log-loss error for Chrome and Firefox, but subpar performance on Brave.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The World Wide Web's connectivity is greatly attributed to the HTTP protocol, with HTTP messages offering informative header fields that appeal to disciplines like web security and privacy, especially concerning web tracking. Despite existing research employing HTTP request messages to identify web trackers, HTTP response headers are often overlooked. This study endeavors to design effective machine learning classifiers for web tracker detection using binarized HTTP response headers. Data from the Chrome, Firefox, and Brave browsers, obtained through the traffic monitoring browser extension T.EX, serves as our dataset. Ten supervised models were trained on Chrome data and tested across all browsers, including a Chrome dataset from a year later. The results demonstrated high accuracy, F1-score, precision, recall, and minimal log-loss error for Chrome and Firefox, but subpar performance on Brave, potentially due to its distinct data distribution and feature set. The research suggests that these classifiers are viable for web tracker detection. However, real-world application testing remains pending, and the distinction between tracker types and broader label sources could be explored in future studies.
Related papers
- ChatHTTPFuzz: Large Language Model-Assisted IoT HTTP Fuzzing [18.095573835226787]
Internet of Things (IoT) devices offer convenience through web interfaces, web VPNs, and other web-based services, all relying on the HTTP protocol.
Most state-of-the-art tools still rely on random mutation trategies, leading to difficulties in accurately understanding the HTTP protocol's structure and generating many invalid test cases.
We propose a novel LLM-guided IoT HTTP fuzzing method, ChatHTTPFuzz, which automatically parses protocol fields and analyzes service code logic to generate protocol-compliant test cases.
arXiv Detail & Related papers (2024-11-18T10:48:53Z) - Beyond Browsing: API-Based Web Agents [58.39129004543844]
API-based agents outperform web browsing agents in experiments on WebArena.
Hybrid Agents out-perform both others nearly uniformly across tasks.
Results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.
arXiv Detail & Related papers (2024-10-21T19:46:06Z) - How Unique is Whose Web Browser? The role of demographics in browser fingerprinting among US users [50.699390248359265]
Browser fingerprinting can be used to identify and track users across the Web, even without cookies.
This technique and resulting privacy risks have been studied for over a decade.
We provide a first-of-its-kind dataset to enable further research.
arXiv Detail & Related papers (2024-10-09T14:51:58Z) - The HTTP Garden: Discovering Parsing Vulnerabilities in HTTP/1.1 Implementations by Differential Fuzzing of Request Streams [7.012240324005978]
HTTP/1.1 parsing discrepancies have been the basis for numerous classes of attacks against web servers.
Our system, the HTTP Garden, examines both origin servers' interpretations and gateway servers' transformations of HTTP requests.
Using our tool, we have discovered and reported over 100 HTTP parsing bugs in popular web servers, of which 68 have been fixed following our reports.
arXiv Detail & Related papers (2024-05-28T01:48:05Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - Fingerprinting web servers through Transformer-encoded HTTP response headers [0.0]
We leverage state-of-the-art deep learning, big data, and natural language processing to enhance the detection of vulnerable web server versions.
We conducted experiments by sending various ambiguous and non-standard HTTP requests to 4.77 million domains.
arXiv Detail & Related papers (2024-03-26T17:24:28Z) - Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection.
To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs.
We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z) - HTTP2vec: Embedding of HTTP Requests for Detection of Anomalous Traffic [0.0]
We propose an unsupervised language representation model for embedding HTTP requests and then using it to classify anomalies in the traffic.
The solution is motivated by methods used in Natural Language Processing (NLP) such as Doc2Vec.
To verify how the solution would work in real word conditions, we train the model using only legitimate traffic.
arXiv Detail & Related papers (2021-08-03T21:53:31Z) - A machine learning approach for detecting CNAME cloaking-based tracking
on the Web [2.7267622401439255]
We propose a supervised learning-based method to detect machine cloaking-based tracking without the on-demand DNS lookup API.
Our goal is to detect both sites and requests linked to cloaking-related tracking.
Our evaluation shows that the proposed approach outperforms well-known tracking filter lists.
arXiv Detail & Related papers (2020-09-29T22:33:19Z) - High-Performance Long-Term Tracking with Meta-Updater [75.80564183653274]
Long-term visual tracking has drawn increasing attention because it is much closer to practical applications than short-term tracking.
Most top-ranked long-term trackers adopt the offline-trained Siamese architectures, thus, they cannot benefit from great progress of short-term trackers with online update.
We propose a novel offline-trained Meta-Updater to address an important but unsolved problem: Is the tracker ready for updating in the current frame?
arXiv Detail & Related papers (2020-04-01T09:29:23Z) - PyODDS: An End-to-end Outlier Detection System with Automated Machine
Learning [55.32009000204512]
We present PyODDS, an automated end-to-end Python system for Outlier Detection with Database Support.
Specifically, we define the search space in the outlier detection pipeline, and produce a search strategy within the given search space.
It also provides unified interfaces and visualizations for users with or without data science or machine learning background.
arXiv Detail & Related papers (2020-03-12T03:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.