Related papers: Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

URL: http://arxiv.org/abs/2309.17410v1
Date: Fri, 29 Sep 2023 17:12:43 GMT
Title: Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Authors: Vaidehi Patil, Peter Hase, Mohit Bansal
Abstract summary: We propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because this approach should guarantee that particular deleted information is never extracted by future prompt attacks. We show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time.
Score: 73.53327403684676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models.

Related papers

Information Leakage of Sentence Embeddings via Generative Embedding Inversion Attacks [1.6427658855248815]
In this study, we reproduce GEIA's findings across various neural sentence embedding models. We propose a simple yet effective method without any modification to the attacker's architecture proposed in GEIA. Our findings indicate that following our approach, an adversary party can recover meaningful sensitive information related to the pre-training knowledge of the popular models used for creating sentence embeddings.
arXiv Detail & Related papers (2025-04-23T10:50:23Z)
Do Unlearning Methods Remove Information from Language Model Weights? [0.0]
Large Language Models' knowledge of how to perform cyber-security attacks, create bioweapons, and manipulate humans poses risks of misuse. We propose an adversarial evaluation method to test for the removal of information from model weights. We show that using fine-tuning on the accessible facts can recover 88% of the pre-unlearning accuracy when applied to current unlearning methods.
arXiv Detail & Related papers (2024-10-11T14:06:58Z)
REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space [35.61862064581971]
Large language models (LLMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data. We propose REVS, a novel model editing method for unlearning sensitive information from LLMs.
arXiv Detail & Related papers (2024-06-13T17:02:32Z)
Representation Noising: A Defence Mechanism Against Harmful Finetuning [28.451676139178687]
Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes. We propose Representation Noising (RepNoise), a defence mechanism that operates even when attackers have access to the weights.
arXiv Detail & Related papers (2024-05-23T13:51:55Z)
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack. When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z)
Attention-Enhancing Backdoor Attacks Against BERT-based Models [54.070555070629105]
Investigating the strategies of backdoor attacks will help to understand the model's vulnerability. We propose a novel Trojan Attention Loss (TAL) which enhances the Trojan behavior by directly manipulating the attention patterns.
arXiv Detail & Related papers (2023-10-23T01:24:56Z)
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [102.63973600144308]
Open-source large language models can be easily subverted to generate harmful content. Experiments across 8 models released by 5 different organizations demonstrate the effectiveness of shadow alignment attack. This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
arXiv Detail & Related papers (2023-10-04T16:39:31Z)
FedDefender: Client-Side Attack-Tolerant Federated Learning [60.576073964874]
Federated learning enables learning from decentralized data sources without compromising privacy. It is vulnerable to model poisoning attacks, where malicious clients interfere with the training process. We propose a new defense mechanism that focuses on the client-side, called FedDefender, to help benign clients train robust local models.
arXiv Detail & Related papers (2023-07-18T08:00:41Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
MOVE: Effective and Harmless Ownership Verification via Embedded External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously. We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features. In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z)
Deletion Inference, Reconstruction, and Compliance in Machine (Un)Learning [21.404426803200796]
Privacy attacks on machine learning models aim to identify the data that is used to train such models. Many machine learning methods are recently extended to support machine unlearning.
arXiv Detail & Related papers (2022-02-07T19:02:58Z)
Amnesiac Machine Learning [15.680008735220785]
Recently enacted General Data Protection Regulation affects any data holder that has data on European Union residents. Models are vulnerable to information leaking attacks such as model inversion attacks. We present two data removal methods, namely Unlearning and Amnesiac Unlearning, that enable model owners to protect themselves against such attacks while being compliant with regulations.
arXiv Detail & Related papers (2020-10-21T13:14:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.