Related papers: Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

URL: http://arxiv.org/abs/2305.11759v1
Date: Fri, 19 May 2023 15:45:29 GMT
Title: Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning
Authors: Mustafa Safa Ozdayi and Charith Peris and Jack FitzGerald and Christophe Dupuy and Jimit Majmudar and Haidar Khan and Rahil Parikh and Rahul Gupta
Abstract summary: Large Language Models (LLMs) are known to memorize significant portions of their training data. We present a novel approach which uses prompt-tuning to control the extraction rates of memorized content in LLMs.
Score: 14.228909822681373
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are known to memorize significant portions of their training data. Parts of this memorized content have been shown to be extractable by simply querying the model, which poses a privacy risk. We present a novel approach which uses prompt-tuning to control the extraction rates of memorized content in LLMs. We present two prompt training strategies to increase and decrease extraction rates, which correspond to an attack and a defense, respectively. We demonstrate the effectiveness of our techniques by using models from the GPT-Neo family on a public benchmark. For the 1.3B parameter GPT-Neo model, our attack yields a 9.3 percentage point increase in extraction rate compared to our baseline. Our defense can be tuned to achieve different privacy-utility trade-offs by a user-specified hyperparameter. We achieve an extraction rate reduction of up to 97.7% relative to our baseline, with a perplexity increase of 16.9%.

Related papers

ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data [95.69966871257381]
Language models (LMs) can memorize and reproduce segments verbatim even in non-adversarial settings. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation. We develop a variant of ParaPO that uses system prompts to control regurgitation behavior.
arXiv Detail & Related papers (2025-04-20T01:59:46Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models [42.81731204702258]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective method that operates on the text prompts to indirectly purify poisoned Vision-Language Models (VLMs) CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner. Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z)
PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding [8.98944128441731]
We show that it is possible to improve the extractability of personal identifiable information (PII) by over ten-fold by grounding the manually constructed extraction prompt with in-domain data. Our approach achieves PII phone number extraction rates of 0.92%, 3.9%, and 6.86% with 1, 128, 128, and 2308 queries, respectively, i.e., the phone number of 15 person in 15 is extractable.
arXiv Detail & Related papers (2024-07-03T09:20:04Z)
Beyond Slow Signs in High-fidelity Model Extraction [18.330719989672442]
Deep neural networks, costly to train and rich in intellectual property value, are increasingly threatened by model extraction attacks. Previous attacks have succeeded in reverse-engineering model parameters up to a precision of float64 for models trained on random data with at most three hidden layers. We introduce a unified optimisation that integrates previous methods and reveal that computational tools can significantly influence performance.
arXiv Detail & Related papers (2024-06-14T13:24:07Z)
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z)
Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z)
Locally Differentially Private Document Generation Using Zero Shot Prompting [61.20953109732442]
We propose a locally differentially private mechanism called DP-Prompt to counter author de-anonymization attacks. When DP-Prompt is used with a powerful language model like ChatGPT (gpt-3.5), we observe a notable reduction in the success rate of de-anonymization attacks.
arXiv Detail & Related papers (2023-10-24T18:25:13Z)
Model Leeching: An Extraction Attack Targeting LLMs [4.533013952442819]
Model Leeching is a novel extraction attack targeting Large Language Models (LLMs) We demonstrate the effectiveness of our attack by extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match (EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87%, respectively for only $50 in API cost.
arXiv Detail & Related papers (2023-09-19T11:45:29Z)
Targeted Attack on GPT-Neo for the SATML Language Model Data Extraction Challenge [4.438873396405334]
We apply a targeted data extraction attack to the SATML2023 Language Model Training Data Extraction Challenge. We maximise the recall of the model and are able to extract the suffix for 69% of the samples. Our approach reaches a score of 0.405 recall at a 10% false positive rate, which is an improvement of 34% over the baseline of 0.301.
arXiv Detail & Related papers (2023-02-13T18:00:44Z)
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model. vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z)
How Does Data Augmentation Affect Privacy in Machine Learning? [94.52721115660626]
We propose new MI attacks to utilize the information of augmented data. We establish the optimal membership inference when the model is trained with augmented data.
arXiv Detail & Related papers (2020-07-21T02:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.