An Adversarial Perspective on Machine Unlearning for AI Safety
- URL: http://arxiv.org/abs/2409.18025v2
- Date: Sun, 6 Oct 2024 23:30:44 GMT
- Title: An Adversarial Perspective on Machine Unlearning for AI Safety
- Authors: Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando,
- Abstract summary: This work challenges the fundamental differences between unlearning and traditional safety post-training.
We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully.
For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU.
- Score: 22.639683142004372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.
Related papers
- Open Problems in Machine Unlearning for AI Safety [61.43515658834902]
Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks.
In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety.
arXiv Detail & Related papers (2025-01-09T03:59:10Z) - OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.
This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z) - Mitigating Unsafe Feedback with Learning Constraints [26.872318173182414]
We show that safety-aligned Large Language Models can explore unsafe action spaces through generating harmful text.
We adapt a number of both "implict" and "explicit" harmful fine-tuning defences to evaluate whether they are effective as learning constraints.
arXiv Detail & Related papers (2024-09-19T17:10:34Z) - Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models [19.015202590038996]
We design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack unlearned models.
We propose Latent Adrial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process.
We demonstrate that LAU improves unlearning effectiveness by over $53.5%$, cause only less than a $11.6%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.
arXiv Detail & Related papers (2024-08-20T09:36:04Z) - UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI [50.61495097098296]
We revisit the paradigm in which unlearning is used for Large Language Models (LLMs)
We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context.
We argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation.
arXiv Detail & Related papers (2024-06-27T10:24:35Z) - Improving Alignment and Robustness with Circuit Breakers [40.4558948850276]
We present an approach that interrupts the models as they respond with harmful outputs with "circuit breakers"
As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs.
We extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack.
arXiv Detail & Related papers (2024-06-06T17:57:04Z) - Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models [64.5204594279587]
A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm.
We propose to balance safety and helpfulness in diverse use cases by controlling both attributes in large language models.
arXiv Detail & Related papers (2024-04-01T17:59:06Z) - Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake
Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs.
This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z) - Inspect, Understand, Overcome: A Survey of Practical Methods for AI
Safety [54.478842696269304]
The use of deep neural networks (DNNs) in safety-critical applications is challenging due to numerous model-inherent shortcomings.
In recent years, a zoo of state-of-the-art techniques aiming to address these safety concerns has emerged.
Our paper addresses both machine learning experts and safety engineers.
arXiv Detail & Related papers (2021-04-29T09:54:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.