How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages
- URL: http://arxiv.org/abs/2407.04470v1
- Date: Fri, 5 Jul 2024 12:39:51 GMT
- Title: How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages
- Authors: Nicola Leschke, Daniela Pöhn, Frank Pallas,
- Abstract summary: European Union's General Data Protection Regulation strengthened data subjects' right to access personal data.
Subjects' possibilities for actually using controller-provided subject access request packages (SARPs) are severely limited so far.
This dataset is publicly provided and shall, in the future, serve as a starting point for researching and comparing novel approaches for practically viable use of SARPs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The European Union's General Data Protection Regulation (GDPR) strengthened several rights for individuals (data subjects). One of these is the data subjects' right to access their personal data being collected by services (data controllers), complemented with a new right to data portability. Based on these, data controllers are obliged to provide respective data and allow data subjects to use them at their own discretion. However, the subjects' possibilities for actually using and harnessing said data are severely limited so far. Among other reasons, this can be attributed to a lack of research dedicated to the actual use of controller-provided subject access request packages (SARPs). To open up and facilitate such research, we outline a general, high-level method for generating, pre-processing, publishing, and finally using SARPs of different providers. Furthermore, we establish a realistic dataset comprising two users' SARPs from five services. This dataset is publicly provided and shall, in the future, serve as a starting and reference point for researching and comparing novel approaches for the practically viable use of SARPs.
Related papers
- Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources [5.898893619901382]
We propose a framework for the collaborative and private generation of synthetic data from distributed data holders.
We replace the trusted aggregator with secure multi-party computation protocols and output privacy via differential privacy (DP)
We demonstrate the applicability and scalability of our approach for the state-of-the-art select-measure-generate algorithms MWEM+PGM and AIM.
arXiv Detail & Related papers (2024-02-13T17:26:32Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Synthetic is all you need: removing the auxiliary data assumption for
membership inference attacks against synthetic data [9.061271587514215]
We show how this assumption can be removed, allowing for MIAs to be performed using only the synthetic data.
Our results show that MIAs are still successful, across two real-world datasets and two synthetic data generators.
arXiv Detail & Related papers (2023-07-04T13:16:03Z) - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting [65.71129509623587]
Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning.
However, the promising results achieved on current public datasets may not be applicable to practical scenarios.
We introduce the LargeST benchmark dataset, which includes a total of 8,600 sensors in California with a 5-year time coverage.
arXiv Detail & Related papers (2023-06-14T05:48:36Z) - Towards Generalizable Data Protection With Transferable Unlearnable
Examples [50.628011208660645]
We present a novel, generalizable data protection method by generating transferable unlearnable examples.
To the best of our knowledge, this is the first solution that examines data privacy from the perspective of data distribution.
arXiv Detail & Related papers (2023-05-18T04:17:01Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Don't Look at the Data! How Differential Privacy Reconfigures the
Practices of Data Science [0.0]
Differential privacy (DP) is one promising way to offer privacy along with open access.
We conduct interviews with 19 data practitioners who are non-experts in DP.
We find that while DP is promising for providing wider access to sensitive datasets, it also introduces challenges into every stage of the data science workflow.
arXiv Detail & Related papers (2023-02-23T04:28:14Z) - Secure Multiparty Computation for Synthetic Data Generation from
Distributed Data [7.370727048591523]
Legal and ethical restrictions on accessing relevant data inhibit data science research in critical domains such as health, finance, and education.
Existing approaches assume that the data holders supply their raw data to a trusted curator, who uses it as fuel for synthetic data generation.
We propose the first solution in which data holders only share encrypted data for differentially private synthetic data generation.
arXiv Detail & Related papers (2022-10-13T20:09:17Z) - Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets
for Public Use [0.4462475518267084]
CDC has collected person-level, de-identified data from jurisdictions and currently has over 8 million records.
Data elements were included based on the usefulness, public request, and privacy implications.
Specific field values were suppressed to reduce risk of reidentification and exposure of confidential information.
arXiv Detail & Related papers (2021-01-13T14:24:20Z) - GDPR: When the Right to Access Personal Data Becomes a Threat [63.732639864601914]
We examine more than 300 data controllers performing for each of them a request to access personal data.
We find that 50.4% of the data controllers that handled the request, have flaws in the procedure of identifying the users.
With the undesired and surprising result that, in its present deployment, has actually decreased the privacy of the users of web services.
arXiv Detail & Related papers (2020-05-04T22:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.