Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies
- URL: http://arxiv.org/abs/2004.11131v2
- Date: Sat, 30 Mar 2024 12:21:59 GMT
- Title: Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies
- Authors: Mukund Srinath, Shomir Wilson, C. Lee Giles,
- Abstract summary: We create PrivaSeer, a corpus of over one million English language website privacy policies.
We show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.
- Score: 13.09699710197036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.
Related papers
- Differential Privacy Overview and Fundamental Techniques [63.0409690498569]
This chapter is meant to be part of the book "Differential Privacy in Artificial Intelligence: From Theory to Practice"
It starts by illustrating various attempts to protect data privacy, emphasizing where and why they failed.
It then defines the key actors, tasks, and scopes that make up the domain of privacy-preserving data analysis.
arXiv Detail & Related papers (2024-11-07T13:52:11Z) - PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action [54.11479432110771]
PrivacyLens is a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories.
We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds.
State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions.
arXiv Detail & Related papers (2024-08-29T17:58:38Z) - Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory [43.12744258781724]
We formulate the privacy issue as a reasoning problem rather than simple pattern matching.
We develop the first comprehensive checklist that covers social identities, private attributes, and existing privacy regulations.
arXiv Detail & Related papers (2024-08-19T14:48:04Z) - Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory [82.7042006247124]
We show that even the most capable AI models reveal private information in contexts that humans would not, 39% and 57% of the time, respectively.
Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.
arXiv Detail & Related papers (2023-10-27T04:15:30Z) - PLUE: Language Understanding Evaluation Benchmark for Privacy Policies
in English [77.79102359580702]
We introduce the Privacy Policy Language Understanding Evaluation benchmark, a multi-task benchmark for evaluating the privacy policy language understanding.
We also collect a large corpus of privacy policies to enable privacy policy domain-specific language model pre-training.
We demonstrate that domain-specific continual pre-training offers performance improvements across all tasks.
arXiv Detail & Related papers (2022-12-20T05:58:32Z) - Algorithms with More Granular Differential Privacy Guarantees [65.3684804101664]
We consider partial differential privacy (DP), which allows quantifying the privacy guarantee on a per-attribute basis.
In this work, we study several basic data analysis and learning tasks, and design algorithms whose per-attribute privacy parameter is smaller that the best possible privacy parameter for the entire record of a person.
arXiv Detail & Related papers (2022-09-08T22:43:50Z) - Privacy Policies Across the Ages: Content and Readability of Privacy
Policies 1996--2021 [1.5229257192293197]
We analyze the 25-year history of privacy policies using methods from transparency research, machine learning, and natural language processing.
We collect a large-scale longitudinal corpus of privacy policies from 1996 to 2021.
Our results show that policies are getting longer and harder to read, especially after new regulations take effect.
arXiv Detail & Related papers (2022-01-21T15:13:02Z) - Private Reinforcement Learning with PAC and Regret Guarantees [69.4202374491817]
We design privacy preserving exploration policies for episodic reinforcement learning (RL)
We first provide a meaningful privacy formulation using the notion of joint differential privacy (JDP)
We then develop a private optimism-based learning algorithm that simultaneously achieves strong PAC and regret bounds, and enjoys a JDP guarantee.
arXiv Detail & Related papers (2020-09-18T20:18:35Z) - Privacy Policies over Time: Curation and Analysis of a Million-Document
Dataset [6.060757543617328]
We develop a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive's Wayback Machine.
We curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites.
Our data indicate that self-regulation for first-party websites has stagnated, while self-regulation for third parties has increased but is dominated by online advertising trade associations.
arXiv Detail & Related papers (2020-08-20T19:00:37Z) - APPCorp: A Corpus for Android Privacy Policy Document Structure Analysis [16.618995752616296]
In this work we create a manually labelled corpus containing $167$ privacy policies.
We report the annotation process and details of the annotated corpus.
We benchmark our data corpus with $4$ document classification models, thoroughly analyze the results and discuss challenges and opportunities for the research committee to use the corpus.
arXiv Detail & Related papers (2020-05-14T13:25:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.