Differentially Private Language Models Benefit from Public Pre-training
- URL: http://arxiv.org/abs/2009.05886v2
- Date: Mon, 26 Oct 2020 16:04:43 GMT
- Title: Differentially Private Language Models Benefit from Public Pre-training
- Authors: Gavin Kerrigan and Dylan Slack and Jens Tuyls
- Abstract summary: We study the feasibility of learning a language model which is simultaneously high-quality and privacy preserving.
We find that DP fine-tuning boosts the performance of language models in the private domain.
- Score: 1.2676356746752895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language modeling is a keystone task in natural language processing. When
training a language model on sensitive information, differential privacy (DP)
allows us to quantify the degree to which our private data is protected.
However, training algorithms which enforce differential privacy often lead to
degradation in model quality. We study the feasibility of learning a language
model which is simultaneously high-quality and privacy preserving by tuning a
public base model on a private corpus. We find that DP fine-tuning boosts the
performance of language models in the private domain, making the training of
such models possible.
Related papers
- PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Selective Pre-training for Private Fine-tuning [33.55628974557588]
We show that a careful pre-training on a public dataset is crucial to train small language models with differential privacy.
Results demonstrate that smaller models, through careful pre-training and private fine-tuning, can match the performance of much larger models that do not have access to private data.
arXiv Detail & Related papers (2023-05-23T09:36:58Z) - Can Public Large Language Models Help Private Cross-device Federated Learning? [58.05449579773249]
We study (differentially) private federated learning (FL) of language models.
Public data has been used to improve privacy-utility trade-offs for both large and small language models.
We propose a novel distribution matching algorithm with theoretical grounding to sample public data close to private data distribution.
arXiv Detail & Related papers (2023-05-20T07:55:58Z) - Differentially Private Language Models for Secure Data Sharing [19.918137395199224]
In this paper, we show how to train a generative language model in a differentially private manner and consequently sampling data from it.
Using natural language prompts and a new prompt-mismatch loss, we are able to create highly accurate and fluent textual datasets.
We perform thorough experiments indicating that our synthetic datasets do not leak information from our original data and are of high language quality.
arXiv Detail & Related papers (2022-10-25T11:12:56Z) - Q-LSTM Language Model -- Decentralized Quantum Multilingual Pre-Trained
Language Model for Privacy Protection [6.0038761646405225]
Large-scale language models are trained on a massive amount of natural language data that might encode or reflect our private information.
malicious agents can reverse engineer the training data even if data sanitation and differential privacy algorithms were involved in the pre-training process.
We propose a decentralized training framework to address privacy concerns in training large-scale language models.
arXiv Detail & Related papers (2022-10-06T21:29:17Z) - Lifting the Curse of Multilinguality by Pre-training Modular
Transformers [72.46919537293068]
multilingual pre-trained models suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages.
We introduce language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant.
Our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.
arXiv Detail & Related papers (2022-05-12T17:59:56Z) - Just Fine-tune Twice: Selective Differential Privacy for Large Language
Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models.
Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z) - Selective Differential Privacy for Language Modeling [36.64464956102432]
Previous work has attempted to tackle this challenge by training RNN-based language models with differential privacy guarantees.
We propose a new privacy notion, selective differential privacy, to provide rigorous privacy guarantees on the sensitive portion of the data.
Experiments on both language modeling and dialog system building show that the proposed privacy-preserving mechanism achieves better utilities.
arXiv Detail & Related papers (2021-08-30T01:11:10Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.