Privacy Policies over Time: Curation and Analysis of a Million-Document
Dataset
- URL: http://arxiv.org/abs/2008.09159v4
- Date: Tue, 20 Jul 2021 19:09:53 GMT
- Title: Privacy Policies over Time: Curation and Analysis of a Million-Document
Dataset
- Authors: Ryan Amos, Gunes Acar, Eli Lucherini, Mihir Kshirsagar, Arvind
Narayanan, Jonathan Mayer
- Abstract summary: We develop a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive's Wayback Machine.
We curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites.
Our data indicate that self-regulation for first-party websites has stagnated, while self-regulation for third parties has increased but is dominated by online advertising trade associations.
- Score: 6.060757543617328
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated analysis of privacy policies has proved a fruitful research
direction, with developments such as automated policy summarization, question
answering systems, and compliance detection. Prior research has been limited to
analysis of privacy policies from a single point in time or from short spans of
time, as researchers did not have access to a large-scale, longitudinal,
curated dataset. To address this gap, we developed a crawler that discovers,
downloads, and extracts archived privacy policies from the Internet Archive's
Wayback Machine. Using the crawler and following a series of validation and
quality control steps, we curated a dataset of 1,071,488 English language
privacy policies, spanning over two decades and over 130,000 distinct websites.
Our analyses of the data paint a troubling picture of the transparency and
accessibility of privacy policies. By comparing the occurrence of
tracking-related terminology in our dataset to prior web privacy measurements,
we find that privacy policies have consistently failed to disclose the presence
of common tracking technologies and third parties. We also find that over the
last twenty years privacy policies have become even more difficult to read,
doubling in length and increasing a full grade in the median reading level. Our
data indicate that self-regulation for first-party websites has stagnated,
while self-regulation for third parties has increased but is dominated by
online advertising trade associations. Finally, we contribute to the literature
on privacy regulation by demonstrating the historic impact of the GDPR on
privacy policies.
Related papers
- PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action [54.11479432110771]
PrivacyLens is a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories.
We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds.
State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions.
arXiv Detail & Related papers (2024-08-29T17:58:38Z) - A Survey of Privacy-Preserving Model Explanations: Privacy Risks, Attacks, and Countermeasures [50.987594546912725]
Despite a growing corpus of research in AI privacy and explainability, there is little attention on privacy-preserving model explanations.
This article presents the first thorough survey about privacy attacks on model explanations and their countermeasures.
arXiv Detail & Related papers (2024-03-31T12:44:48Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - A Survey on Privacy in Graph Neural Networks: Attacks, Preservation, and
Applications [76.88662943995641]
Graph Neural Networks (GNNs) have gained significant attention owing to their ability to handle graph-structured data.
To address this issue, researchers have started to develop privacy-preserving GNNs.
Despite this progress, there is a lack of a comprehensive overview of the attacks and the techniques for preserving privacy in the graph domain.
arXiv Detail & Related papers (2023-08-31T00:31:08Z) - More Data Types More Problems: A Temporal Analysis of Complexity,
Stability, and Sensitivity in Privacy Policies [0.0]
Data brokers and data processors are part of a multi-billion-dollar industry that profits from collecting, buying, and selling consumer data.
Yet there is little transparency in the data collection industry which makes it difficult to understand what types of data are being collected, used, and sold.
arXiv Detail & Related papers (2023-02-17T15:21:24Z) - How Do Input Attributes Impact the Privacy Loss in Differential Privacy? [55.492422758737575]
We study the connection between the per-subject norm in DP neural networks and individual privacy loss.
We introduce a novel metric termed the Privacy Loss-Input Susceptibility (PLIS) which allows one to apportion the subject's privacy loss to their input attributes.
arXiv Detail & Related papers (2022-11-18T11:39:03Z) - How to keep text private? A systematic review of deep learning methods
for privacy-preserving natural language processing [0.38073142980732994]
Article systematically reviews over sixty methods for privacy-preserving NLP published between 2016 and 2020.
We introduce a novel taxonomy for classifying the existing methods into three categories: methods trusted methods verification methods.
We discuss open challenges in privacy-preserving NLP regarding data traceability, overhead dataset size and the prevalence of human biases in embeddings.
arXiv Detail & Related papers (2022-05-20T11:29:44Z) - Privacy Policies Across the Ages: Content and Readability of Privacy
Policies 1996--2021 [1.5229257192293197]
We analyze the 25-year history of privacy policies using methods from transparency research, machine learning, and natural language processing.
We collect a large-scale longitudinal corpus of privacy policies from 1996 to 2021.
Our results show that policies are getting longer and harder to read, especially after new regulations take effect.
arXiv Detail & Related papers (2022-01-21T15:13:02Z) - AI-enabled Automation for Completeness Checking of Privacy Policies [7.707284039078785]
In Europe, privacy policies are subject to compliance with the General Data Protection Regulation.
In this paper, we propose AI-based automation for completeness checking privacy policies.
arXiv Detail & Related papers (2021-06-10T12:10:51Z) - PolicyQA: A Reading Comprehension Dataset for Privacy Policies [77.79102359580702]
We present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies.
We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA.
arXiv Detail & Related papers (2020-10-06T09:04:58Z) - Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies [13.09699710197036]
We create PrivaSeer, a corpus of over one million English language website privacy policies.
We show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.
arXiv Detail & Related papers (2020-04-23T13:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.