Delete My Account: Impact of Data Deletion on Machine Learning
Classifiers
- URL: http://arxiv.org/abs/2311.10385v1
- Date: Fri, 17 Nov 2023 08:23:17 GMT
- Title: Delete My Account: Impact of Data Deletion on Machine Learning
Classifiers
- Authors: Tobias Dam and Maximilian Henzl and Lukas Daniel Klausner
- Abstract summary: The right to erasure has potential implications for a number of different fields, such as big data and machine learning.
Our paper presents an in-depth analysis about the impact of the use of the right to erasure on the performance of machine learning models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Users are more aware than ever of the importance of their own data, thanks to
reports about security breaches and leaks of private, often sensitive data in
recent years. Additionally, the GDPR has been in effect in the European Union
for over three years and many people have encountered its effects in one way or
another. Consequently, more and more users are actively protecting their
personal data. One way to do this is to make of the right to erasure guaranteed
in the GDPR, which has potential implications for a number of different fields,
such as big data and machine learning.
Our paper presents an in-depth analysis about the impact of the use of the
right to erasure on the performance of machine learning models on
classification tasks. We conduct various experiments utilising different
datasets as well as different machine learning algorithms to analyse a variety
of deletion behaviour scenarios. Due to the lack of credible data on actual
user behaviour, we make reasonable assumptions for various deletion modes and
biases and provide insight into the effects of different plausible scenarios
for right to erasure usage on data quality of machine learning. Our results
show that the impact depends strongly on the amount of data deleted, the
particular characteristics of the dataset and the bias chosen for deletion and
assumptions on user behaviour.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning.
We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori.
We find some methods to perform better than others across the board.
We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - MaSS: Multi-attribute Selective Suppression [8.337285030303285]
We propose Multi-attribute Selective Suppression, or MaSS, a framework for performing precisely targeted data surgery.
MaSS learns a data modifier through adversarial games between two sets of networks, where one is aimed at suppressing selected attributes.
We carried out an extensive evaluation of our proposed method using multiple datasets from different domains.
arXiv Detail & Related papers (2022-10-18T14:44:08Z) - Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources.
We show theoretically that this reduces the variance of the ATE estimate.
We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z) - Correlated Differential Privacy: Feature Selection in Machine Learning [13.477069421691562]
The proposed scheme involves five steps with the goal of managing the extent of data correlation, preserving the privacy, and supporting accuracy in the prediction results.
Experiments show that the proposed scheme can produce better prediction results with machine learning tasks and fewer mean square errors for data queries compared to existing schemes.
arXiv Detail & Related papers (2020-10-07T00:33:24Z) - Neither Private Nor Fair: Impact of Data Imbalance on Utility and
Fairness in Differential Privacy [5.416049433853457]
We study how different levels of imbalance in the data affect the accuracy and the fairness of the decisions made by the model.
We demonstrate that even small imbalances and loose privacy guarantees can cause disparate impacts.
arXiv Detail & Related papers (2020-09-10T18:35:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.