Related papers: Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs

Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs

URL: http://arxiv.org/abs/2503.12347v1
Date: Sun, 16 Mar 2025 04:00:32 GMT
Title: Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs
Authors: Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu,
Abstract summary: We propose a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale finetuning.<n> CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data.<n>To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram.
Score: 20.774525687291167
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution, depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.

Related papers

DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning [51.35628297101575]
Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data.<n>We introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs.<n>We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts.
arXiv Detail & Related papers (2026-02-20T22:03:56Z)
Improving Noise Efficiency in Privacy-preserving Dataset Distillation [59.57846442477106]
We introduce a novel framework that decouples sampling from optimization for better convergence and improves signal quality.<n>On CIFAR-10, our method achieves a textbf10.0% improvement with 50 images per class and textbf8.3% increase with just textbfone-fifth the distilled set size of previous state-of-the-art methods.
arXiv Detail & Related papers (2025-08-03T13:15:52Z)
Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z)
DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators [47.86275136491794]
We propose a two-stage fine-tuning framework for differentially private data generation.<n>The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset.<n>Our results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts.
arXiv Detail & Related papers (2024-12-03T14:10:09Z)
Differentially Private Non Parametric Copulas: Generating synthetic data with non parametric copulas under privacy guarantees [0.0]
This work focuses on enhancing a non-parametric co-pula-based synthetic data generation model, DPNPC, by incorporating Differential Privacy. We compare DPNPC with three other models (PrivBayes, DP-Copula, and DP-Histogram) across three public datasets, evaluating privacy, utility, and execution time.
arXiv Detail & Related papers (2024-09-27T10:18:14Z)
Private prediction for large-scale synthetic text generation [28.488459921169905]
We present an approach for generating differentially private synthetic text using large language models (LLMs) In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees.
arXiv Detail & Related papers (2024-07-16T18:28:40Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Differentially Private Tabular Data Synthesis using Large Language Models [6.6376578496141585]
This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis. DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure. It generates synthetic data through sampling the fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-03T15:43:57Z)
Quantifying and Mitigating Privacy Risks for Tabular Generative Models [13.153278585144355]
Synthetic data from generative models emerges as the privacy-preserving data-sharing solution. We propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model. We show that DP-TLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability.
arXiv Detail & Related papers (2024-03-12T17:27:49Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge. Existing private generative models are struggling with the utility of synthetic samples. We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z)
Just Fine-tune Twice: Selective Differential Privacy for Large Language Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models. Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.