On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift
- URL: http://arxiv.org/abs/2312.15551v4
- Date: Mon, 2 Sep 2024 03:26:58 GMT
- Title: On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift
- Authors: Pratiksha Thaker, Amrith Setlur, Zhiwei Steven Wu, Virginia Smith,
- Abstract summary: We show that public pretraining can improve private training accuracy by up to 67% over private training from scratch.
We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training.
- Score: 40.553022057469285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Public pretraining is a promising approach to improve differentially private model training. However, recent work has noted that many positive research results studying this paradigm only consider in-distribution tasks, and may not apply to settings where there is distribution shift between the pretraining and finetuning data -- a scenario that is likely when finetuning private tasks due to the sensitive nature of the data. In this work, we show empirically across three tasks that even in settings with large distribution shift, where both zero-shot performance from public data and training from scratch with private data give unusably weak results, public features can in fact improve private training accuracy by up to 67\% over private training from scratch. We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training even if it is impossible to learn the private task from the public data alone. Altogether, our results provide evidence that public data can indeed make private training practical in realistic settings of extreme distribution shift.
Related papers
- Training generative models from privatized data [9.584000954415476]
Local differential privacy is a powerful method for privacy-preserving data collection.
We develop a framework for training Generative Adversarial Networks (GANs) on differentially privatized data.
arXiv Detail & Related papers (2023-06-15T23:28:45Z) - PILLAR: How to make semi-private learning more effective [12.292092677396347]
In Semi-Supervised Semi-Private (SP) learning, the learner has access to both public unlabelled and private labelled data.
We propose a computationally efficient algorithm that achieves significantly lower private labelled sample complexity and can be efficiently run on real-world datasets.
arXiv Detail & Related papers (2023-06-06T18:45:05Z) - Can Public Large Language Models Help Private Cross-device Federated Learning? [58.05449579773249]
We study (differentially) private federated learning (FL) of language models.
Public data has been used to improve privacy-utility trade-offs for both large and small language models.
We propose a novel distribution matching algorithm with theoretical grounding to sample public data close to private data distribution.
arXiv Detail & Related papers (2023-05-20T07:55:58Z) - Why Is Public Pretraining Necessary for Private Model Training? [50.054565310457306]
We show that pretraining on publicly available data leads to distinct gains over nonprivate settings.
We argue that the tradeoff may be a deeper loss model that requires an algorithm to go through two phases.
Guided by intuition, we provide theoretical constructions that provably demonstrate the separation between private with and without public pretraining.
arXiv Detail & Related papers (2023-02-19T05:32:20Z) - Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining [75.25943383604266]
We question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving.
We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy.
We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.
arXiv Detail & Related papers (2022-12-13T10:41:12Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Mixed Differential Privacy in Computer Vision [133.68363478737058]
AdaMix is an adaptive differentially private algorithm for training deep neural network classifiers using both private and public image data.
A few-shot or even zero-shot learning baseline that ignores private data can outperform fine-tuning on a large private dataset.
arXiv Detail & Related papers (2022-03-22T06:15:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.