Related papers: DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

URL: http://arxiv.org/abs/2412.02467v1
Date: Tue, 03 Dec 2024 14:10:09 GMT
Title: DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators
Authors: Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, Mario Fritz,
Abstract summary: We propose a two-stage fine-tuning framework for differentially private data generation.<n>The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset.<n>Our results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts.
Score: 47.86275136491794
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose \ours, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at https://github.com/tejuafonja/DP-2Stage.

Related papers

Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs [20.774525687291167]
We propose a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram.
arXiv Detail & Related papers (2025-03-16T04:00:32Z)
Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning [62.224804688233]
differential privacy (DP) offers a promising solution by ensuring models are 'almost indistinguishable' with or without any particular privacy unit. We study user-level DP motivated by applications where it necessary to ensure uniform privacy protection across users.
arXiv Detail & Related papers (2024-06-20T13:54:32Z)
Differentially Private Tabular Data Synthesis using Large Language Models [6.6376578496141585]
This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis. DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure. It generates synthetic data through sampling the fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-03T15:43:57Z)
Enhancing Scalability of Metric Differential Privacy via Secret Dataset Partitioning and Benders Decomposition [1.283608820493284]
Metric Differential Privacy (mDP) extends the concept of Differential Privacy (DP) to serve as a new paradigm of data. It is designed to protect secret data represented in general metric space, such as text data encoded as word embeddings or geo-location data on the road network or grid maps.
arXiv Detail & Related papers (2024-05-07T14:19:09Z)
DP-TabICL: In-Context Learning with Differentially Private Tabular Data [12.814878223075437]
In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks. LLMs can leak information contained in prompts. This work serves as an initial investigation into how to use differential privacy (DP) We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL)
arXiv Detail & Related papers (2024-03-08T21:19:01Z)
Privacy Amplification for the Gaussian Mechanism via Bounded Support [64.86780616066575]
Data-dependent privacy accounting frameworks such as per-instance differential privacy (pDP) and Fisher information loss (FIL) confer fine-grained privacy guarantees for individuals in a fixed training dataset. We propose simple modifications of the Gaussian mechanism with bounded support, showing that they amplify privacy guarantees under data-dependent accounting.
arXiv Detail & Related papers (2024-03-07T21:22:07Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
DP2-Pub: Differentially Private High-Dimensional Data Publication with Invariant Post Randomization [58.155151571362914]
We propose a differentially private high-dimensional data publication mechanism (DP2-Pub) that runs in two phases. splitting attributes into several low-dimensional clusters with high intra-cluster cohesion and low inter-cluster coupling helps obtain a reasonable privacy budget. We also extend our DP2-Pub mechanism to the scenario with a semi-honest server which satisfies local differential privacy.
arXiv Detail & Related papers (2022-08-24T17:52:43Z)
Just Fine-tune Twice: Selective Differential Privacy for Large Language Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models. Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z)
DP-SGD vs PATE: Which Has Less Disparate Impact on GANs? [0.0]
We compare GANs trained with the two best-known DP frameworks for deep learning, DP-SGD, and PATE, in different data imbalance settings. Our experiments consistently show that for PATE, unlike DP-SGD, the privacy-utility trade-off is not monotonically decreasing.
arXiv Detail & Related papers (2021-11-26T17:25:46Z)
Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence [73.14373832423156]
We propose DP-Sinkhorn, a novel optimal transport-based generative method for learning data distributions from private data with differential privacy. Unlike existing approaches for training differentially private generative models, we do not rely on adversarial objectives.
arXiv Detail & Related papers (2021-11-01T18:10:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.