Teach LLMs to Phish: Stealing Private Information from Language Models
- URL: http://arxiv.org/abs/2403.00871v1
- Date: Fri, 1 Mar 2024 06:15:07 GMT
- Title: Teach LLMs to Phish: Stealing Private Information from Language Models
- Authors: Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang,
Yaoqing Yang, Prateek Mittal
- Abstract summary: We propose a new practical data extraction attack that we call "neural phishing"
Our attack assumes only that an adversary can insert as few as 10s of benign-appearing sentences into the training dataset using only vague priors on the structure of the user data.
- Score: 41.24348056248685
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: When large language models are trained on private data, it can be a
significant privacy risk for them to memorize and regurgitate sensitive
information. In this work, we propose a new practical data extraction attack
that we call "neural phishing". This attack enables an adversary to target and
extract sensitive or personally identifiable information (PII), e.g., credit
card numbers, from a model trained on user data with upwards of 10% attack
success rates, at times, as high as 50%. Our attack assumes only that an
adversary can insert as few as 10s of benign-appearing sentences into the
training dataset using only vague priors on the structure of the user data.
Related papers
- Persistent Pre-Training Poisoning of LLMs [71.53046642099142]
Our work evaluates for the first time whether language models can also be compromised during pre-training.
We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary.
Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to persist through post-training.
arXiv Detail & Related papers (2024-10-17T16:27:13Z) - Evaluating Large Language Model based Personal Information Extraction and Countermeasures [63.91918057570824]
Large language model (LLM) can be misused by attackers to accurately extract various personal information from personal profiles.
LLM outperforms conventional methods at such extraction.
prompt injection can mitigate such risk to a large extent and outperforms conventional countermeasures.
arXiv Detail & Related papers (2024-08-14T04:49:30Z) - FLTrojan: Privacy Leakage Attacks against Federated Language Models Through Selective Weight Tampering [2.2194815687410627]
We show how a malicious client can leak the privacy-sensitive data of some other users in FL even without any cooperation from the server.
Our best-performing method improves the membership inference recall by 29% and achieves up to 71% private data reconstruction.
arXiv Detail & Related papers (2023-10-24T19:50:01Z) - Defending Our Privacy With Backdoors [29.722113621868978]
We propose an easy yet effective defense based on backdoor attacks to remove private information from vision-language models.
Specifically, by strategically inserting backdoors into text encoders, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's actual name.
Our approach provides a new "dual-use" perspective on backdoor attacks and presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.
arXiv Detail & Related papers (2023-10-12T13:33:04Z) - Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
Against Extraction Attacks [73.53327403684676]
We propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights.
We study direct edits to model weights because this approach should guarantee that particular deleted information is never extracted by future prompt attacks.
We show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time.
arXiv Detail & Related papers (2023-09-29T17:12:43Z) - Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets [53.866927712193416]
We show that an adversary who can poison a training dataset can cause models trained on this dataset to leak private details belonging to other parties.
Our attacks are effective across membership inference, attribute inference, and data extraction.
Our results cast doubts on the relevance of cryptographic privacy guarantees in multiparty protocols for machine learning.
arXiv Detail & Related papers (2022-03-31T18:06:28Z) - Machine unlearning via GAN [2.406359246841227]
Machine learning models, especially deep models, may unintentionally remember information about their training data.
We present a GAN-based algorithm to delete data in deep models, which significantly improves deleting speed compared to retraining from scratch.
arXiv Detail & Related papers (2021-11-22T05:28:57Z) - Extracting Training Data from Large Language Models [78.3839333127544]
This paper demonstrates that an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.
We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.
arXiv Detail & Related papers (2020-12-14T18:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.