Differentially Private Synthetic Data via Foundation Model APIs 2: Text
- URL: http://arxiv.org/abs/2403.01749v2
- Date: Tue, 23 Jul 2024 19:19:02 GMT
- Title: Differentially Private Synthetic Data via Foundation Model APIs 2: Text
- Authors: Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin,
- Abstract summary: A lot of high-quality text data generated in the real world is private and cannot be shared or used freely due to privacy concerns.
We propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text.
Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines.
- Score: 56.13240830670327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.
Related papers
- Private Text Generation by Seeding Large Language Model Prompts [13.407214545457778]
We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus.
We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones.
arXiv Detail & Related papers (2025-02-18T16:50:38Z) - Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data? [19.72500788849435]
Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data.
Recent advancements in large language models (LLMs) have inspired a number of algorithm techniques for improving DP synthetic data generation.
One family of approaches uses DP finetuning on the foundation model weights; however, the model weights for state-of-the-art models may not be public.
arXiv Detail & Related papers (2025-02-10T15:23:52Z) - Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model [13.28430346661924]
Differentially private (DP) synthetic data has become a key tool for unlocking the value of private data without compromising privacy.
Private Evolution (PE) has emerged as a promising method for generating DP synthetic data.
We show that simulators -- computer graphics-based image synthesis tools -- can also serve as effective APIs within the PE framework.
arXiv Detail & Related papers (2025-02-08T09:50:30Z) - Differentially Private Steering for Large Language Model Alignment [55.30573701583768]
We present the first study of aligning Large Language Models with private datasets.
Our work proposes the textitunderlinePrivate underlineSteering for LLM underlineAment (PSA) algorithm.
Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance.
arXiv Detail & Related papers (2025-01-30T17:58:36Z) - Cool-Fusion: Fuse Large Language Models without Training [73.17551121242602]
emphCool-Fusion is a method that does not require any type of training like the ensemble approaches.
emphCool-Fusion increases accuracy from three strong source LLMs by a significant 8%-17.8%.
arXiv Detail & Related papers (2024-07-29T09:02:19Z) - Private prediction for large-scale synthetic text generation [28.488459921169905]
We present an approach for generating differentially private synthetic text using large language models (LLMs)
In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees.
arXiv Detail & Related papers (2024-07-16T18:28:40Z) - Differentially Private Knowledge Distillation via Synthetic Text Generation [5.201318326501886]
We propose DistilDP: a novel differentially private knowledge distillation algorithm.
DistilDP exploits synthetic data generated by a differentially private teacher LLM.
Our experimental results demonstrate that DistilDP can substantially improve the utility over existing baselines.
arXiv Detail & Related papers (2024-03-01T19:22:24Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Source Attribution for Large Language Model-Generated Data [57.85840382230037]
It is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text.
We show that this problem can be tackled by watermarking.
We propose a source attribution framework that satisfies these key properties due to our algorithmic designs.
arXiv Detail & Related papers (2023-10-01T12:02:57Z) - A Survey of Pretrained Language Models Based Text Generation [97.64625999380425]
Text Generation aims to produce plausible and readable text in human language from input data.
Deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs)
Grounding text generation on PLMs is seen as a promising direction in both academia and industry.
arXiv Detail & Related papers (2022-01-14T01:44:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.