Differentially Private ERM Based on Data Perturbation
- URL: http://arxiv.org/abs/2002.08578v1
- Date: Thu, 20 Feb 2020 06:05:34 GMT
- Title: Differentially Private ERM Based on Data Perturbation
- Authors: Yilin Kang, Yong Liu, Lizhong Ding, Xinwang Liu, Xinyi Tong and
Weiping Wang
- Abstract summary: We measure the contributions of various training data instances on the final machine learning model.
Considering that the key of our method is to measure each data instance separately, we propose a new Data perturbation' based (DB) paradigm for DP-ERM.
- Score: 41.37436071802578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, after observing that different training data instances affect
the machine learning model to different extents, we attempt to improve the
performance of differentially private empirical risk minimization (DP-ERM) from
a new perspective. Specifically, we measure the contributions of various
training data instances on the final machine learning model, and select some of
them to add random noise. Considering that the key of our method is to measure
each data instance separately, we propose a new `Data perturbation' based (DB)
paradigm for DP-ERM: adding random noise to the original training data and
achieving ($\epsilon,\delta$)-differential privacy on the final machine
learning model, along with the preservation on the original data. By
introducing the Influence Function (IF), we quantitatively measure the impact
of the training data on the final model. Theoretical and experimental results
show that our proposed DBDP-ERM paradigm enhances the model performance
significantly.
Related papers
- Influence Functions for Scalable Data Attribution in Diffusion Models [52.92223039302037]
Diffusion models have led to significant advancements in generative modelling.
Yet their widespread adoption poses challenges regarding data attribution and interpretability.
In this paper, we aim to help address such challenges by developing an textitinfluence functions framework.
arXiv Detail & Related papers (2024-10-17T17:59:02Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Private Synthetic Data Meets Ensemble Learning [15.425653946755025]
When machine learning models are trained on synthetic data and then deployed on real data, there is often a performance drop.
We introduce a new ensemble strategy for training downstream models, with the goal of enhancing their performance when used on real data.
arXiv Detail & Related papers (2023-10-15T04:24:42Z) - Arbitrary Decisions are a Hidden Cost of Differentially Private Training [7.560688419767116]
Mechanisms used in machine learning often aim to guarantee differential privacy (DP) during model training.
Practical DP-ensuring training methods use randomization when fitting model parameters to privacy-sensitive data.
For a given input example, the output predicted by equally-private models depends on the randomness used in training.
arXiv Detail & Related papers (2023-02-28T12:13:43Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Reconstructing Training Data from Diverse ML Models by Ensemble
Inversion [8.414622657659168]
Model Inversion (MI), in which an adversary abuses access to a trained Machine Learning (ML) model, has attracted increasing research attention.
We propose an ensemble inversion technique that estimates the distribution of original training data by training a generator constrained by an ensemble of trained models.
We achieve high quality results without any dataset and show how utilizing an auxiliary dataset that's similar to the presumed training data improves the results.
arXiv Detail & Related papers (2021-11-05T18:59:01Z) - An Analysis of the Deployment of Models Trained on Private Tabular
Synthetic Data: Unexpected Surprises [4.129847064263057]
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models.
We study the effects of differentially private synthetic data generation on classification.
arXiv Detail & Related papers (2021-06-15T21:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.