An Adversarial Perspective on Machine Unlearning for AI Safety
- URL: http://arxiv.org/abs/2409.18025v3
- Date: Fri, 08 Nov 2024 22:06:41 GMT
- Title: An Adversarial Perspective on Machine Unlearning for AI Safety
- Authors: Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando,
- Abstract summary: This work challenges the fundamental differences between unlearning and traditional safety post-training.
We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully.
For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU.
- Score: 22.639683142004372
- License:
- Abstract: Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.
Related papers
- Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models [19.015202590038996]
We design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack unlearned models.
We propose Latent Adrial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process.
We demonstrate that LAU improves unlearning effectiveness by over $53.5%$, cause only less than a $11.6%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.
arXiv Detail & Related papers (2024-08-20T09:36:04Z) - UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI [50.61495097098296]
We revisit the paradigm in which unlearning is used for Large Language Models (LLMs)
We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context.
We argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation.
arXiv Detail & Related papers (2024-06-27T10:24:35Z) - Improving Alignment and Robustness with Circuit Breakers [40.4558948850276]
We present an approach that interrupts the models as they respond with harmful outputs with "circuit breakers"
As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs.
We extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack.
arXiv Detail & Related papers (2024-06-06T17:57:04Z) - Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models [64.5204594279587]
A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm.
We propose to balance safety and helpfulness in diverse use cases by controlling both attributes in large language models.
arXiv Detail & Related papers (2024-04-01T17:59:06Z) - Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features.
backdoor attacks subtly embed malicious behaviors within the model during training.
We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z) - Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake
Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs.
This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z) - Inspect, Understand, Overcome: A Survey of Practical Methods for AI
Safety [54.478842696269304]
The use of deep neural networks (DNNs) in safety-critical applications is challenging due to numerous model-inherent shortcomings.
In recent years, a zoo of state-of-the-art techniques aiming to address these safety concerns has emerged.
Our paper addresses both machine learning experts and safety engineers.
arXiv Detail & Related papers (2021-04-29T09:54:54Z) - Adversarial Training is Not Ready for Robot Learning [55.493354071227174]
Adversarial training is an effective method to train deep learning models that are resilient to norm-bounded perturbations.
We show theoretically and experimentally that neural controllers obtained via adversarial training are subjected to three types of defects.
Our results suggest that adversarial training is not yet ready for robot learning.
arXiv Detail & Related papers (2021-03-15T07:51:31Z) - Modeling Penetration Testing with Reinforcement Learning Using
Capture-the-Flag Challenges: Trade-offs between Model-free Learning and A
Priori Knowledge [0.0]
Penetration testing is a security exercise aimed at assessing the security of a system by simulating attacks against it.
This paper focuses on simplified penetration testing problems expressed in the form of capture the flag hacking challenges.
We show how this challenge may be eased by relying on different forms of prior knowledge that may be provided to the agent.
arXiv Detail & Related papers (2020-05-26T11:23:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.