Learning from Mixtures of Private and Public Populations
- URL: http://arxiv.org/abs/2008.00331v1
- Date: Sat, 1 Aug 2020 20:11:50 GMT
- Title: Learning from Mixtures of Private and Public Populations
- Authors: Raef Bassily, Shay Moran and Anupama Nandi
- Abstract summary: We study a new model of supervised learning under privacy constraints.
The goal is to design a learning algorithm that satisfies differential privacy only with respect to private examples.
- Score: 25.365515662502784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We initiate the study of a new model of supervised learning under privacy
constraints. Imagine a medical study where a dataset is sampled from a
population of both healthy and unhealthy individuals. Suppose healthy
individuals have no privacy concerns (in such case, we call their data
"public") while the unhealthy individuals desire stringent privacy protection
for their data. In this example, the population (data distribution) is a
mixture of private (unhealthy) and public (healthy) sub-populations that could
be very different.
Inspired by the above example, we consider a model in which the population
$\mathcal{D}$ is a mixture of two sub-populations: a private sub-population
$\mathcal{D}_{\sf priv}$ of private and sensitive data, and a public
sub-population $\mathcal{D}_{\sf pub}$ of data with no privacy concerns. Each
example drawn from $\mathcal{D}$ is assumed to contain a privacy-status bit
that indicates whether the example is private or public. The goal is to design
a learning algorithm that satisfies differential privacy only with respect to
the private examples.
Prior works in this context assumed a homogeneous population where private
and public data arise from the same distribution, and in particular designed
solutions which exploit this assumption. We demonstrate how to circumvent this
assumption by considering, as a case study, the problem of learning linear
classifiers in $\mathbb{R}^d$. We show that in the case where the privacy
status is correlated with the target label (as in the above example), linear
classifiers in $\mathbb{R}^d$ can be learned, in the agnostic as well as the
realizable setting, with sample complexity which is comparable to that of the
classical (non-private) PAC-learning. It is known that this task is impossible
if all the data is considered private.
Related papers
- Differentially Private Computation of Basic Reproduction Numbers in Networked Epidemic Models [3.1966459264817875]
We develop a framework to compute and release the reproduction number of a networked epidemic model in a differentially private way.
We show that, under real-world conditions, we can compute $R_0$ in a differentially private way while incurring errors as low as $7.6%$ on average.
arXiv Detail & Related papers (2023-09-29T14:38:02Z) - Private Distribution Learning with Public Data: The View from Sample
Compression [15.626115475759713]
We study the problem of private distribution learning with access to public data.
We show that the public-private learnability of a class $mathcal Q$ is connected to the existence of a sample compression scheme.
arXiv Detail & Related papers (2023-08-11T17:15:12Z) - Probing the Transition to Dataset-Level Privacy in ML Models Using an
Output-Specific and Data-Resolved Privacy Profile [23.05994842923702]
We study a privacy metric that quantifies the extent to which a model trained on a dataset using a Differential Privacy mechanism is covered" by each of the distributions resulting from training on neighboring datasets.
We show that the privacy profile can be used to probe an observed transition to indistinguishability that takes place in the neighboring distributions as $epsilon$ decreases.
arXiv Detail & Related papers (2023-06-27T20:39:07Z) - Learning across Data Owners with Joint Differential Privacy [13.531808240117645]
We study the setting in which data owners train machine learning models collaboratively under a privacy notion called joint differential privacy.
In this setting, the model trained for each data owner $j$ uses $j$'s data without privacy consideration and other owners' data with differential privacy guarantees.
We present an algorithm that is a variant of DP-SGD and provides theoretical bounds on its population loss.
arXiv Detail & Related papers (2023-05-25T05:11:40Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Analyzing Privacy Leakage in Machine Learning via Multiple Hypothesis
Testing: A Lesson From Fano [83.5933307263932]
We study data reconstruction attacks for discrete data and analyze it under the framework of hypothesis testing.
We show that if the underlying private data takes values from a set of size $M$, then the target privacy parameter $epsilon$ can be $O(log M)$ before the adversary gains significant inferential power.
arXiv Detail & Related papers (2022-10-24T23:50:12Z) - Smooth Anonymity for Sparse Graphs [69.1048938123063]
differential privacy has emerged as the gold standard of privacy, however, when it comes to sharing sparse datasets.
In this work, we consider a variation of $k$-anonymity, which we call smooth-$k$-anonymity, and design simple large-scale algorithms that efficiently provide smooth-$k$-anonymity.
arXiv Detail & Related papers (2022-07-13T17:09:25Z) - Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent [69.14164921515949]
We characterize privacy guarantees for individual examples when releasing models trained by DP-SGD.
We find that most examples enjoy stronger privacy guarantees than the worst-case bound.
This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees.
arXiv Detail & Related papers (2022-06-06T13:49:37Z) - Mixed Differential Privacy in Computer Vision [133.68363478737058]
AdaMix is an adaptive differentially private algorithm for training deep neural network classifiers using both private and public image data.
A few-shot or even zero-shot learning baseline that ignores private data can outperform fine-tuning on a large private dataset.
arXiv Detail & Related papers (2022-03-22T06:15:43Z) - On the Intrinsic Differential Privacy of Bagging [69.70602220716718]
We show that Bagging achieves significantly higher accuracies than state-of-the-art differentially private machine learning methods with the same privacy budgets.
Our experimental results demonstrate that Bagging achieves significantly higher accuracies than state-of-the-art differentially private machine learning methods with the same privacy budgets.
arXiv Detail & Related papers (2020-08-22T14:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.