Related papers: Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

URL: http://arxiv.org/abs/2403.01749v2
Date: Tue, 23 Jul 2024 19:19:02 GMT
Title: Differentially Private Synthetic Data via Foundation Model APIs 2: Text
Authors: Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin,
Abstract summary: A lot of high-quality text data generated in the real world is private and cannot be shared or used freely due to privacy concerns. We propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines.
Score: 56.13240830670327
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.

Related papers

PCEvolve: Private Contrastive Evolution for Synthetic Dataset Generation via Few-Shot Private Data and Generative APIs [39.108700932535754]
Private Evolution (PE) algorithm generates Differential Privacy (DP) synthetic images using diffusion model APIs.<n>In practice, the few-shot private data challenge is particularly prevalent in specialized domains like healthcare and industry.<n>We propose a novel API-assisted algorithm, Private Contrastive Evolution (PCEvolve), which iteratively mines inherent inter-class contrastive relationships in few-shot private data.
arXiv Detail & Related papers (2025-06-04T13:33:06Z)
Private Text Generation by Seeding Large Language Model Prompts [13.407214545457778]
We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus. We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones.
arXiv Detail & Related papers (2025-02-18T16:50:38Z)
Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data? [19.72500788849435]
Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. Recent advancements in large language models (LLMs) have inspired a number of algorithm techniques for improving DP synthetic data generation. One family of approaches uses DP finetuning on the foundation model weights; however, the model weights for state-of-the-art models may not be public.
arXiv Detail & Related papers (2025-02-10T15:23:52Z)
Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model [13.28430346661924]
Differentially private (DP) synthetic data has become a key tool for unlocking the value of private data without compromising privacy. Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. We show that simulators -- computer graphics-based image synthesis tools -- can also serve as effective APIs within the PE framework.
arXiv Detail & Related papers (2025-02-08T09:50:30Z)
Private Synthetic Text Generation with Diffusion Models [13.240347195231305]
We show that fully open-source LLMs outperform diffusion models in the privacy regime. Our complete source codes, datasets, and experimental setup are publicly available to foster future research.
arXiv Detail & Related papers (2024-10-30T12:38:49Z)
Cool-Fusion: Fuse Large Language Models without Training [73.17551121242602]
emphCool-Fusion is a method that does not require any type of training like the ensemble approaches. emphCool-Fusion increases accuracy from three strong source LLMs by a significant 8%-17.8%.
arXiv Detail & Related papers (2024-07-29T09:02:19Z)
Private prediction for large-scale synthetic text generation [28.488459921169905]
We present an approach for generating differentially private synthetic text using large language models (LLMs) In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees.
arXiv Detail & Related papers (2024-07-16T18:28:40Z)
Data Augmentation for Text-based Person Retrieval Using Large Language Models [16.120524750964016]
Text-based Person Retrieval (TPR) aims to retrieve person images that match the description given a text query. It is difficult to construct a large-scale, high-quality TPR dataset due to expensive annotation and privacy protection. This paper proposes an LLM-based Data Augmentation (LLM-DA) method for TPR.
arXiv Detail & Related papers (2024-05-20T11:57:50Z)
Differentially Private Knowledge Distillation via Synthetic Text Generation [5.201318326501886]
We propose DistilDP: a novel differentially private knowledge distillation algorithm. DistilDP exploits synthetic data generated by a differentially private teacher LLM. Our experimental results demonstrate that DistilDP can substantially improve the utility over existing baselines.
arXiv Detail & Related papers (2024-03-01T19:22:24Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
Source Attribution for Large Language Model-Generated Data [57.85840382230037]
It is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text. We show that this problem can be tackled by watermarking. We propose a source attribution framework that satisfies these key properties due to our algorithmic designs.
arXiv Detail & Related papers (2023-10-01T12:02:57Z)
LLMDet: A Third Party Large Language Models Generated Text Detection Tool [119.0952092533317]
Large language models (LLMs) are remarkably close to high-quality human-authored text. Existing detection tools can only differentiate between machine-generated and human-authored text. We propose LLMDet, a model-specific, secure, efficient, and extendable detection tool.
arXiv Detail & Related papers (2023-05-24T10:45:16Z)
A Survey of Pretrained Language Models Based Text Generation [97.64625999380425]
Text Generation aims to produce plausible and readable text in human language from input data. Deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs) Grounding text generation on PLMs is seen as a promising direction in both academia and industry.
arXiv Detail & Related papers (2022-01-14T01:44:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.