Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks
- URL: http://arxiv.org/abs/2406.13356v3
- Date: Tue, 08 Oct 2024 08:35:13 GMT
- Title: Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks
- Authors: Shengyuan Hu, Yiwei Fu, Zhiwei Steven Wu, Virginia Smith,
- Abstract summary: We show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of targeted relearning attacks.
With access to only a small and potentially loosely related set of data, we find that we can "jog" the memory of unlearned models to reverse the effects of unlearning.
- Score: 37.061187080745654
- License:
- Abstract: Machine unlearning is a promising approach to mitigate undesirable memorization of training data in LLMs. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of targeted relearning attacks. With access to only a small and potentially loosely related set of data, we find that we can "jog" the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study.
Related papers
- Extracting Unlearned Information from LLMs with Activation Steering [46.16882599881247]
Unlearning has emerged as a solution to remove sensitive knowledge from models after training.
We propose activation steering as a method for exact information retrieval from unlearned models.
Our results demonstrate that exact information retrieval from unlearned models is possible, highlighting a severe vulnerability of current unlearning techniques.
arXiv Detail & Related papers (2024-11-04T21:42:56Z) - Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge [36.524827594501495]
We show that applying quantization to models that have undergone unlearning can restore the "forgotten" information.
We find that for unlearning methods with utility constraints, the unlearned model retains an average of 21% of the intended forgotten knowledge in full precision.
arXiv Detail & Related papers (2024-10-21T19:28:37Z) - A Closer Look at Machine Unlearning for Large Language Models [46.245404272612795]
Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns.
We discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches.
arXiv Detail & Related papers (2024-10-10T16:56:05Z) - Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective [32.93858075964824]
We introduce a new task of LLM targeted unlearning, where given an unlearning target and some unlearning documents, we aim to unlearn only the information about the target.
We argue that a successful unlearning should satisfy criteria such as not outputting gibberish, not fabricating facts about the unlearning target, and not releasing factual information under jailbreak attacks.
This framework justifies and extends WHP, deriving a simple unlearning algorithm that includes WHP as a special case.
arXiv Detail & Related papers (2024-07-24T04:39:24Z) - UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI [50.61495097098296]
We revisit the paradigm in which unlearning is used for Large Language Models (LLMs)
We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context.
We argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation.
arXiv Detail & Related papers (2024-06-27T10:24:35Z) - Textual Unlearning Gives a False Sense of Unlearning [12.792770622915906]
Language models (LMs) are susceptible to "memorizing" training data, including a large amount of private or copyright-protected content.
We propose the Textual Unlearning Leakage Attack (TULA), where an adversary can infer information about unlearned data only by accessing the models before and after unlearning.
Our work is the first to reveal that machine unlearning in LMs can inversely create greater knowledge risks and inspire the development of more secure unlearning mechanisms.
arXiv Detail & Related papers (2024-06-19T08:51:54Z) - Offset Unlearning for Large Language Models [49.851093293780615]
Unlearning has emerged as a potential remedy for Large Language Models affected by problematic training data.
We propose $delta$-unlearning, an offset unlearning framework for black-box LLMs.
Experiments demonstrate that $delta$-unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks.
arXiv Detail & Related papers (2024-04-17T03:39:51Z) - Rethinking Machine Unlearning for Large Language Models [85.92660644100582]
We explore machine unlearning in the domain of large language models (LLMs)
This initiative aims to eliminate undesirable data influence (e.g., sensitive or illegal information) and the associated model capabilities.
arXiv Detail & Related papers (2024-02-13T20:51:58Z) - Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data.
This process might suffer from privacy issues and violations of data protection regulations.
We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.