Harnessing large-language models to generate private synthetic text
- URL: http://arxiv.org/abs/2306.01684v2
- Date: Thu, 11 Jan 2024 00:17:43 GMT
- Title: Harnessing large-language models to generate private synthetic text
- Authors: Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, Andreas
Terzis
- Abstract summary: Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information.
This paper studies an alternative approach to generate synthetic data that is differentially private with respect to the original data, and then non-privately training a model on the synthetic data.
generating private synthetic data is much harder than training a private model.
- Score: 18.863579044812703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Differentially private training algorithms like DP-SGD protect sensitive
training data by ensuring that trained models do not reveal private
information. An alternative approach, which this paper studies, is to use a
sensitive dataset to generate synthetic data that is differentially private
with respect to the original data, and then non-privately training a model on
the synthetic data. Doing so has several advantages: synthetic data can be
reused for other tasks (including for hyper parameter tuning), retained
indefinitely, and shared with third parties without sacrificing privacy.
However, generating private synthetic data is much harder than training a
private model. To improve performance on text data, recent work has utilized
public data by starting with a pre-trained generative language model and
privately fine-tuning it on sensitive data. This model can be used to sample a
DP synthetic dataset. While this strategy seems straightforward, executing it
has proven problematic. Previous approaches either show significant performance
loss, or have, as we show, critical design flaws. In this paper we demonstrate
that a proper training objective along with tuning fewer parameters results in
excellent DP synthetic data quality. Our approach is competitive with direct
DP-training of downstream classifiers in terms of performance on downstream
tasks. Further, we demonstrate that our DP synthetic data is not only useful
for downstream classifier training, but also to tune those same models.
Related papers
- Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Pre-training Differentially Private Models with Limited Public Data [54.943023722114134]
differential privacy (DP) is a prominent method to gauge the degree of security provided to the models.
DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training stage.
We develop a novel DP continual pre-training strategy using only 10% of public data.
Our strategy can achieve DP accuracy of 41.5% on ImageNet-21k, as well as non-DP accuracy of 55.7% and and 60.0% on downstream tasks Places365 and iNaturalist-2021.
arXiv Detail & Related papers (2024-02-28T23:26:27Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining [13.823621924706348]
Differential Privacy (DP) image data synthesis allows organizations to share and utilize synthetic images without privacy concerns.
Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data.
This paper proposes a novel DP image synthesis method, termed PRIVIMAGE, which meticulously selects pre-training data.
arXiv Detail & Related papers (2023-10-19T14:04:53Z) - Private Synthetic Data Meets Ensemble Learning [15.425653946755025]
When machine learning models are trained on synthetic data and then deployed on real data, there is often a performance drop.
We introduce a new ensemble strategy for training downstream models, with the goal of enhancing their performance when used on real data.
arXiv Detail & Related papers (2023-10-15T04:24:42Z) - Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning.
We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - An Analysis of the Deployment of Models Trained on Private Tabular
Synthetic Data: Unexpected Surprises [4.129847064263057]
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models.
We study the effects of differentially private synthetic data generation on classification.
arXiv Detail & Related papers (2021-06-15T21:00:57Z) - Differentially Private Synthetic Medical Data Generation using
Convolutional GANs [7.2372051099165065]
We develop a differentially private framework for synthetic data generation using R'enyi differential privacy.
Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics of the generated synthetic data.
We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget.
arXiv Detail & Related papers (2020-12-22T01:03:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.