DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators
- URL: http://arxiv.org/abs/2412.02467v1
- Date: Tue, 03 Dec 2024 14:10:09 GMT
- Title: DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators
- Authors: Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, Mario Fritz,
- Abstract summary: We propose a two-stage fine-tuning framework for differentially private data generation.
The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset.
Our results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts.
- Score: 47.86275136491794
- License:
- Abstract: Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose \ours, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at https://github.com/tejuafonja/DP-2Stage.
Related papers
- Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning [62.224804688233]
differential privacy (DP) offers a promising solution by ensuring models are 'almost indistinguishable' with or without any particular privacy unit.
We study user-level DP motivated by applications where it necessary to ensure uniform privacy protection across users.
arXiv Detail & Related papers (2024-06-20T13:54:32Z) - Differentially Private Tabular Data Synthesis using Large Language Models [6.6376578496141585]
This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis.
DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure.
It generates synthetic data through sampling the fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-03T15:43:57Z) - Enhancing Scalability of Metric Differential Privacy via Secret Dataset Partitioning and Benders Decomposition [1.283608820493284]
Metric Differential Privacy (mDP) extends the concept of Differential Privacy (DP) to serve as a new paradigm of data.
It is designed to protect secret data represented in general metric space, such as text data encoded as word embeddings or geo-location data on the road network or grid maps.
arXiv Detail & Related papers (2024-05-07T14:19:09Z) - DP-TabICL: In-Context Learning with Differentially Private Tabular Data [12.814878223075437]
In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks.
LLMs can leak information contained in prompts.
This work serves as an initial investigation into how to use differential privacy (DP)
We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL)
arXiv Detail & Related papers (2024-03-08T21:19:01Z) - Privacy Amplification for the Gaussian Mechanism via Bounded Support [64.86780616066575]
Data-dependent privacy accounting frameworks such as per-instance differential privacy (pDP) and Fisher information loss (FIL) confer fine-grained privacy guarantees for individuals in a fixed training dataset.
We propose simple modifications of the Gaussian mechanism with bounded support, showing that they amplify privacy guarantees under data-dependent accounting.
arXiv Detail & Related papers (2024-03-07T21:22:07Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Just Fine-tune Twice: Selective Differential Privacy for Large Language
Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models.
Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z) - DP-SGD vs PATE: Which Has Less Disparate Impact on GANs? [0.0]
We compare GANs trained with the two best-known DP frameworks for deep learning, DP-SGD, and PATE, in different data imbalance settings.
Our experiments consistently show that for PATE, unlike DP-SGD, the privacy-utility trade-off is not monotonically decreasing.
arXiv Detail & Related papers (2021-11-26T17:25:46Z) - Don't Generate Me: Training Differentially Private Generative Models
with Sinkhorn Divergence [73.14373832423156]
We propose DP-Sinkhorn, a novel optimal transport-based generative method for learning data distributions from private data with differential privacy.
Unlike existing approaches for training differentially private generative models, we do not rely on adversarial objectives.
arXiv Detail & Related papers (2021-11-01T18:10:21Z) - DTGAN: Differential Private Training for Tabular GANs [6.174448419090292]
We propose DTGAN, a novel conditional Wasserstein GAN that comes in two variants DTGAN_G and DTGAN_D.
We rigorously evaluate the theoretical privacy guarantees offered by DP empirically against membership and attribute inference attacks.
Our results on 3 datasets show that the DP-SGD framework is superior to PATE and that a DP discriminator is more optimal for training convergence.
arXiv Detail & Related papers (2021-07-06T10:28:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.