Mean Estimation with User-level Privacy under Data Heterogeneity
- URL: http://arxiv.org/abs/2307.15835v1
- Date: Fri, 28 Jul 2023 23:02:39 GMT
- Title: Mean Estimation with User-level Privacy under Data Heterogeneity
- Authors: Rachel Cummings and Vitaly Feldman and Audra McMillan and Kunal Talwar
- Abstract summary: Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
- Score: 54.07947274508013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key challenge in many modern data analysis tasks is that user data are
heterogeneous. Different users may possess vastly different numbers of data
points. More importantly, it cannot be assumed that all users sample from the
same underlying distribution. This is true, for example in language data, where
different speech styles result in data heterogeneity. In this work we propose a
simple model of heterogeneous user data that allows user data to differ in both
distribution and quantity of data, and provide a method for estimating the
population-level mean while preserving user-level differential privacy. We
demonstrate asymptotic optimality of our estimator and also prove general lower
bounds on the error achievable in the setting we introduce.
Related papers
- Empirical Mean and Frequency Estimation Under Heterogeneous Privacy: A Worst-Case Analysis [5.755004576310333]
Differential Privacy (DP) is the current gold-standard for measuring privacy.
We consider the problems of empirical mean estimation for univariate data and frequency estimation for categorical data, subject to heterogeneous privacy constraints.
We prove some optimality results, under both PAC error and mean-squared error, for our proposed algorithms and demonstrate superior performance over other baseline techniques experimentally.
arXiv Detail & Related papers (2024-07-15T22:46:02Z) - Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - Estimating Unknown Population Sizes Using the Hypergeometric Distribution [1.03590082373586]
We tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown.
We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable.
Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data.
arXiv Detail & Related papers (2024-02-22T01:53:56Z) - GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using
Macro Data Sources [21.32471030724983]
Individual-level data (microdata) that characterizes a population is essential for studying many real-world problems.
In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data.
arXiv Detail & Related papers (2022-12-08T01:22:12Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Differentially Private Multi-Party Data Release for Linear Regression [40.66319371232736]
Differentially Private (DP) data release is a promising technique to disseminate data without compromising the privacy of data subjects.
In this paper we focus on the multi-party setting, where different stakeholders own disjoint sets of attributes belonging to the same group of data subjects.
We propose our novel method and prove it converges to the optimal (non-private) solutions with increasing dataset size.
arXiv Detail & Related papers (2022-06-16T08:32:17Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Adversarial Deep Feature Extraction Network for User Independent Human
Activity Recognition [4.988898367111902]
We present an adversarial subject-independent feature extraction method with the maximum mean discrepancy (MMD) regularization for human activity recognition.
We evaluate the method on well-known public data sets showing that it significantly improves user-independent performance and reduces variance in results.
arXiv Detail & Related papers (2021-10-23T07:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.