Related papers: Lower Bounds for Public-Private Learning under Distribution Shift

Lower Bounds for Public-Private Learning under Distribution Shift

URL: http://arxiv.org/abs/2507.17895v1
Date: Wed, 23 Jul 2025 19:46:08 GMT
Title: Lower Bounds for Public-Private Learning under Distribution Shift
Authors: Amrith Setlur, Pratiksha Thaker, Jonathan Ullman,
Abstract summary: Most effective differentially private machine learning algorithms rely on an additional source of purportedly public data.<n>We extend the known lower bounds for public-private learning to setting where the two data sources exhibit significant distribution shift.
Score: 5.801359003170208
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The most effective differentially private machine learning algorithms in practice rely on an additional source of purportedly public data. This paradigm is most interesting when the two sources combine to be more than the sum of their parts. However, there are settings such as mean estimation where we have strong lower bounds, showing that when the two data sources have the same distribution, there is no complementary value to combining the two data sources. In this work we extend the known lower bounds for public-private learning to setting where the two data sources exhibit significant distribution shift. Our results apply to both Gaussian mean estimation where the two distributions have different means, and to Gaussian linear regression where the two distributions exhibit parameter shift. We find that when the shift is small (relative to the desired accuracy), either public or private data must be sufficiently abundant to estimate the private parameter. Conversely, when the shift is large, public data provides no benefit.

Related papers

Private Model Personalization Revisited [13.4143747448136]
We study model personalization under user-level differential privacy (DP) in the shared representation framework.<n>Our goal is to privately recover the shared embedding and the local low-dimensional representations with small excess risk.<n>We present an information-theoretic construction to privately learn the shared embedding and derive a margin-based accuracy guarantee.
arXiv Detail & Related papers (2025-06-24T00:57:17Z)
Mutual Information Multinomial Estimation [53.58005108981247]
Estimating mutual information (MI) is a fundamental yet challenging task in data science and machine learning. Our main discovery is that a preliminary estimate of the data distribution can dramatically help estimate. Experiments on diverse tasks including non-Gaussian synthetic problems with known ground-truth and real-world applications demonstrate the advantages of our method.
arXiv Detail & Related papers (2024-08-18T06:27:30Z)
Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples. Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance. We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z)
On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift [40.553022057469285]
We show that public pretraining can improve private training accuracy by up to 67% over private training from scratch. We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training.
arXiv Detail & Related papers (2023-12-24T21:46:14Z)
General Gaussian Noise Mechanisms and Their Optimality for Unbiased Mean Estimation [58.03500081540042]
A classical approach to private mean estimation is to compute the true mean and add unbiased, but possibly correlated, Gaussian noise to it. We show that for every input dataset, an unbiased mean estimator satisfying concentrated differential privacy introduces approximately at least as much error.
arXiv Detail & Related papers (2023-01-31T18:47:42Z)
Private Estimation with Public Data [10.176795938619417]
We study differentially private (DP) estimation with access to a small amount of public data. We show that under the constraints of pure or concentrated DP, d+1 public data samples are sufficient to remove any dependence on the range parameters of the private data distribution.
arXiv Detail & Related papers (2022-08-16T22:46:44Z)
The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift [127.21287240963859]
We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data. For a large class of linear regression instances, transfer learning with $O(N2)$ source data is as effective as supervised learning with $N$ target data.
arXiv Detail & Related papers (2022-08-03T05:59:49Z)
Collaborative Learning of Distributions under Heterogeneity and Communication Constraints [35.82172666266493]
In machine learning, users often have to collaborate to learn distributions that generate the data. We propose a novel two-stage method named SHIFT: First, the users collaborate by communicating with the server to learn a central distribution. Then, the learned central distribution is fine-tuned to estimate the individual distributions of users.
arXiv Detail & Related papers (2022-06-01T18:43:06Z)
Equivariance Discovery by Learned Parameter-Sharing [153.41877129746223]
We study how to discover interpretable equivariances from data. Specifically, we formulate this discovery process as an optimization problem over a model's parameter-sharing schemes. Also, we theoretically analyze the method for Gaussian data and provide a bound on the mean squared gap between the studied discovery scheme and the oracle scheme.
arXiv Detail & Related papers (2022-04-07T17:59:19Z)
Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations [76.82124752950148]
We develop a convenient gradient-based method for selecting the data augmentation. We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective.
arXiv Detail & Related papers (2022-02-22T02:51:11Z)
Adapting deep generative approaches for getting synthetic data with realistic marginal distributions [0.0]
Deep generative models, such as variational autoencoders (VAEs), are a popular approach for creating such synthetic datasets from original data. We propose a novel method, pre-transformation variational autoencoders (PTVAEs), to specifically address bimodal and skewed data. The results show that the PTVAE approach can outperform others in both bimodal and skewed data generation.
arXiv Detail & Related papers (2021-05-14T15:47:20Z)
Fair Densities via Boosting the Sufficient Statistics of Exponential Families [72.34223801798422]
We introduce a boosting algorithm to pre-process data for fairness. Our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee. Empirical results are present to display the quality of result on real-world data.
arXiv Detail & Related papers (2020-12-01T00:49:17Z)
Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines. Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
Distributionally-Robust Machine Learning Using Locally Differentially-Private Data [14.095523601311374]
We consider machine learning, particularly regression, using locally-differentially private datasets. We show that machine learning with locally-differentially private datasets can be rewritten as a distributionally-robust optimization.
arXiv Detail & Related papers (2020-06-24T05:12:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.