Cookiescanner: An Automated Tool for Detecting and Evaluating GDPR
Consent Notices on Websites
- URL: http://arxiv.org/abs/2309.06196v1
- Date: Tue, 12 Sep 2023 13:04:00 GMT
- Title: Cookiescanner: An Automated Tool for Detecting and Evaluating GDPR
Consent Notices on Websites
- Authors: Ralf Gundelach and Dominik Herrmann
- Abstract summary: We present emphcookiescanner, an automated scanning tool that detects and extracts consent notices.
We found that manually filter lists have the highest precision but recall fewer consent notices than our keyword-based methods.
Our BERT model achieves high precision for English notices, which is in line with previous work, but suffers from low recall due to insufficient candidate extraction.
- Score: 1.3416250383686867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The enforcement of the GDPR led to the widespread adoption of consent
notices, colloquially known as cookie banners. Studies have shown that many
website operators do not comply with the law and track users prior to any
interaction with the consent notice, or attempt to trick users into giving
consent through dark patterns. Previous research has relied on manually curated
filter lists or automated detection methods limited to a subset of websites,
making research on GDPR compliance of consent notices tedious or limited. We
present \emph{cookiescanner}, an automated scanning tool that detects and
extracts consent notices via various methods and checks if they offer a decline
option or use color diversion. We evaluated cookiescanner on a random sample of
the top 10,000 websites listed by Tranco. We found that manually curated filter
lists have the highest precision but recall fewer consent notices than our
keyword-based methods. Our BERT model achieves high precision for English
notices, which is in line with previous work, but suffers from low recall due
to insufficient candidate extraction. While the automated detection of decline
options proved to be challenging due to the dynamic nature of many sites,
detecting instances of different colors of the buttons was successful in most
cases. Besides systematically evaluating our various detection techniques, we
have manually annotated 1,000 websites to provide a ground-truth baseline,
which has not existed previously. Furthermore, we release our code and the
annotated dataset in the interest of reproducibility and repeatability.
Related papers
- Enhanced Review Detection and Recognition: A Platform-Agnostic Approach with Application to Online Commerce [0.46040036610482665]
We present a machine learning methodology for review detection and extraction.
We demonstrate that it generalises for use across websites that were not contained in the training data.
This method promises to drive applications for automatic detection and evaluation of reviews, regardless of their source.
arXiv Detail & Related papers (2024-05-09T00:32:22Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Neural Embeddings for Web Testing [49.66745368789056]
Existing crawlers rely on app-specific, threshold-based, algorithms to assess state equivalence.
We propose WEBEMBED, a novel abstraction function based on neural network embeddings and threshold-free classifiers.
Our evaluation on nine web apps shows that WEBEMBED outperforms state-of-the-art techniques by detecting near-duplicates more accurately.
arXiv Detail & Related papers (2023-06-12T19:59:36Z) - Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc.
Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques.
In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Augmenting Rule-based DNS Censorship Detection at Scale with Machine
Learning [38.00013408742201]
Censorship of the domain name system (DNS) is a key mechanism used across different countries.
In this paper, we explore how machine learning (ML) models can help streamline the detection process.
We find that unsupervised models, trained solely on uncensored instances, can identify new instances and variations of censorship missed by existing probes.
arXiv Detail & Related papers (2023-02-03T23:36:30Z) - An End-to-End Set Transformer for User-Level Classification of
Depression and Gambling Disorder [24.776445591293186]
This work proposes a transformer architecture for user-level classification of gambling addiction and depression.
We process a set of social media posts from a particular individual, to make use of the interactions between posts and eliminate label noise at the post level.
Our architecture is interpretable with modern feature attribution methods and allows for automatic dataset creation.
arXiv Detail & Related papers (2022-07-02T06:40:56Z) - An Adversarial Attack Analysis on Malicious Advertisement URL Detection
Framework [22.259444589459513]
Malicious advertisement URLs pose a security risk since they are the source of cyber-attacks.
Existing malicious URL detection techniques are limited and to handle unseen features as well as generalize to test data.
In this study, we extract a novel set of lexical and web-scrapped features and employ machine learning technique to set up system for fraudulent advertisement URLs detection.
arXiv Detail & Related papers (2022-04-27T20:06:22Z) - Automated detection of dark patterns in cookie banners: how to do it
poorly and why it is hard to do it any other way [7.2834950390171205]
A dataset of cookie banners of 300 news websites was used to train a prediction model that does exactly that.
The accuracy of the trained model is promising, but allows a lot of room for improvement.
We provide an in-depth analysis of the interdisciplinary challenges that automated dark pattern detection poses to artificial intelligence.
arXiv Detail & Related papers (2022-04-21T12:10:27Z) - Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV)
We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples.
Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z) - Detection of Novel Social Bots by Ensembles of Specialized Classifiers [60.63582690037839]
Malicious actors create inauthentic social media accounts controlled in part by algorithms, known as social bots, to disseminate misinformation and agitate online discussion.
We show that different types of bots are characterized by different behavioral features.
We propose a new supervised learning method that trains classifiers specialized for each class of bots and combines their decisions through the maximum rule.
arXiv Detail & Related papers (2020-06-11T22:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.