Differentially Private Language Models for Secure Data Sharing
- URL: http://arxiv.org/abs/2210.13918v2
- Date: Wed, 26 Oct 2022 12:55:53 GMT
- Title: Differentially Private Language Models for Secure Data Sharing
- Authors: Justus Mattern, Zhijing Jin, Benjamin Weggenmann, Bernhard Schoelkopf,
Mrinmaya Sachan
- Abstract summary: In this paper, we show how to train a generative language model in a differentially private manner and consequently sampling data from it.
Using natural language prompts and a new prompt-mismatch loss, we are able to create highly accurate and fluent textual datasets.
We perform thorough experiments indicating that our synthetic datasets do not leak information from our original data and are of high language quality.
- Score: 19.918137395199224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To protect the privacy of individuals whose data is being shared, it is of
high importance to develop methods allowing researchers and companies to
release textual data while providing formal privacy guarantees to its
originators. In the field of NLP, substantial efforts have been directed at
building mechanisms following the framework of local differential privacy,
thereby anonymizing individual text samples before releasing them. In practice,
these approaches are often dissatisfying in terms of the quality of their
output language due to the strong noise required for local differential
privacy. In this paper, we approach the problem at hand using global
differential privacy, particularly by training a generative language model in a
differentially private manner and consequently sampling data from it. Using
natural language prompts and a new prompt-mismatch loss, we are able to create
highly accurate and fluent textual datasets taking on specific desired
attributes such as sentiment or topic and resembling statistical properties of
the training data. We perform thorough experiments indicating that our
synthetic datasets do not leak information from our original data and are of
high language quality and highly suitable for training models for further
analysis on real-world data. Notably, we also demonstrate that training
classifiers on private synthetic data outperforms directly training classifiers
on real data with DP-SGD.
Related papers
- Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains [9.123834467375532]
We explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in high-stakes domains.
Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data.
arXiv Detail & Related papers (2024-10-10T19:31:02Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Recovering from Privacy-Preserving Masking with Large Language Models [14.828717714653779]
We use large language models (LLMs) to suggest substitutes of masked tokens.
We show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data.
arXiv Detail & Related papers (2023-09-12T16:39:41Z) - Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning.
We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z) - Collaborative Chinese Text Recognition with Personalized Federated
Learning [61.34060587461462]
In Chinese text recognition, it is often necessary for one organization to collect a large amount of data from similar organizations.
Due to the natural presence of private information in text data, such as addresses and phone numbers, different organizations are unwilling to share private data.
We introduce personalized federated learning (pFL) into the Chinese text recognition task and propose the pFedCR algorithm.
arXiv Detail & Related papers (2023-05-09T16:51:00Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Synthetic Text Generation with Differential Privacy: A Simple and
Practical Recipe [32.63295550058343]
We show that a simple and practical recipe in the text domain is effective in generating useful synthetic text with strong privacy protection.
Our method produces synthetic text that is competitive in terms of utility with its non-private counterpart.
arXiv Detail & Related papers (2022-10-25T21:21:17Z) - Personalization Improves Privacy-Accuracy Tradeoffs in Federated
Optimization [57.98426940386627]
We show that coordinating local learning with private centralized learning yields a generically useful and improved tradeoff between accuracy and privacy.
We illustrate our theoretical results with experiments on synthetic and real-world datasets.
arXiv Detail & Related papers (2022-02-10T20:44:44Z) - TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework
for Deep Learning with Anonymized Intermediate Representations [49.20701800683092]
We present TIPRDC, a task-independent privacy-respecting data crowdsourcing framework with anonymized intermediate representation.
The goal of this framework is to learn a feature extractor that can hide the privacy information from the intermediate representations; while maximally retaining the original information embedded in the raw data for the data collector to accomplish unknown learning tasks.
arXiv Detail & Related papers (2020-05-23T06:21:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.