A Comprehensive Study on GDPR-Oriented Analysis of Privacy Policies: Taxonomy, Corpus and GDPR Concept Classifiers
- URL: http://arxiv.org/abs/2410.04754v1
- Date: Mon, 7 Oct 2024 05:19:12 GMT
- Title: A Comprehensive Study on GDPR-Oriented Analysis of Privacy Policies: Taxonomy, Corpus and GDPR Concept Classifiers
- Authors: Peng Tang, Xin Li, Yuxin Chen, Weidong Qiu, Haochen Mei, Allison Holmes, Fenghua Li, Shujun Li,
- Abstract summary: We develop a more complete taxonomy, created the first corpus of labeled privacy policies with hierarchical information, and conducted the most comprehensive performance evaluation of concept classifiers for privacy policies.
Our work leads to multiple novel findings, including the confirmed inappropriateness of splitting training and test sets at the segment level, the benefits of considering hierarchical information, and the limitations of the "one size fits all" approach, and the significance of testing cross-corpus generalizability.
- Score: 18.770985160731122
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Machine learning based classifiers that take a privacy policy as the input and predict relevant concepts are useful in different applications such as (semi-)automated compliance analysis against requirements of the EU GDPR. In all past studies, such classifiers produce a concept label per segment (e.g., sentence or paragraph) and their performances were evaluated by using a dataset of labeled segments without considering the privacy policy they belong to. However, such an approach could overestimate the performance in real-world settings, where all segments in a new privacy policy are supposed to be unseen. Additionally, we also observed other research gaps, including the lack of a more complete GDPR taxonomy and the less consideration of hierarchical information in privacy policies. To fill such research gaps, we developed a more complete GDPR taxonomy, created the first corpus of labeled privacy policies with hierarchical information, and conducted the most comprehensive performance evaluation of GDPR concept classifiers for privacy policies. Our work leads to multiple novel findings, including the confirmed inappropriateness of splitting training and test sets at the segment level, the benefits of considering hierarchical information, and the limitations of the "one size fits all" approach, and the significance of testing cross-corpus generalizability.
Related papers
- Optimal Federated Learning for Nonparametric Regression with Heterogeneous Distributed Differential Privacy Constraints [5.3595271893779906]
We study federated learning for nonparametric regression in the context of distributed samples across different servers.
Findings shed light on the tradeoff between statistical accuracy and privacy preservation.
arXiv Detail & Related papers (2024-06-10T19:34:07Z) - Extractive text summarisation of Privacy Policy documents using machine learning approaches [0.0]
This work demonstrates two Privacy Policy (PP) summarisation models based on two different clustering algorithms.
Kmeans is be used for the first model after an extensive evaluation of ten commonly used clustering algorithms.
The summariser model based on the PDC-clustering algorithm summarises PP documents by segregating individual sentences by distance each sentence to the pre-defined cluster centres.
arXiv Detail & Related papers (2024-04-09T04:54:08Z) - When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective [64.73162159837956]
evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging.
We propose DataCOPE, a data-centric framework for evaluating a target policy given a dataset.
Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies.
arXiv Detail & Related papers (2023-11-23T17:13:37Z) - Expert opinions on making GDPR usable [0.0]
We use as respondents experts working across fields of relevance to four concepts, including law and data protection/privacy, certifications and standardization, and usability.
We employ theory triangulation to analyze the data representing three groups of experts, categorized as 'criterias', 'law', and 'usability', coming both from industry and academia.
arXiv Detail & Related papers (2023-08-16T11:20:16Z) - Relational Proxies: Emergent Relationships as Fine-Grained
Discriminators [52.17542855760418]
We propose a novel approach that leverages information between the global and local part of an object for encoding its label.
We design Proxies based on our theoretical findings and evaluate it on seven challenging fine-grained benchmark datasets.
We also experimentally validate our theory and obtain consistent results across multiple benchmarks.
arXiv Detail & Related papers (2022-10-05T11:08:04Z) - Is Vertical Logistic Regression Privacy-Preserving? A Comprehensive
Privacy Analysis and Beyond [57.10914865054868]
We consider vertical logistic regression (VLR) trained with mini-batch descent gradient.
We provide a comprehensive and rigorous privacy analysis of VLR in a class of open-source Federated Learning frameworks.
arXiv Detail & Related papers (2022-07-19T05:47:30Z) - Retrieval Enhanced Data Augmentation for Question Answering on Privacy
Policies [74.01792675564218]
We develop a data augmentation framework based on ensembling retriever models that captures relevant text segments from unlabeled policy documents.
To improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models.
Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%.
arXiv Detail & Related papers (2022-04-19T15:45:23Z) - Reinforcement Learning with Heterogeneous Data: Estimation and Inference [84.72174994749305]
We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity.
We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class.
We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset.
arXiv Detail & Related papers (2022-01-31T20:58:47Z) - Partial sensitivity analysis in differential privacy [58.730520380312676]
We investigate the impact of each input feature on the individual's privacy loss.
We experimentally evaluate our approach on queries over private databases.
We also explore our findings in the context of neural network training on synthetic data.
arXiv Detail & Related papers (2021-09-22T08:29:16Z) - Antipodes of Label Differential Privacy: PATE and ALIBI [2.2761657094500682]
We consider the privacy-preserving machine learning (ML) setting where the trained model must satisfy differential privacy (DP)
We propose two novel approaches based on, respectively, the Laplace mechanism and the PATE framework.
We show how to achieve very strong privacy levels in some regimes, with our adaptation of the PATE framework.
arXiv Detail & Related papers (2021-06-07T08:14:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.