Related papers: An Adversarial Perspective on Machine Unlearning for AI Safety

An Adversarial Perspective on Machine Unlearning for AI Safety

URL: http://arxiv.org/abs/2409.18025v2
Date: Sun, 6 Oct 2024 23:30:44 GMT
Title: An Adversarial Perspective on Machine Unlearning for AI Safety
Authors: Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando,
Abstract summary: This work challenges the fundamental differences between unlearning and traditional safety post-training. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU.
Score: 22.639683142004372
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

Related papers

Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization [0.562479170374811]
Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning.<n>Recent studies show that even specialized unlearning methods can be easily reversed.<n>We introduce Disruption Masking, a technique in which we only allow updating weights.
arXiv Detail & Related papers (2025-06-14T12:49:51Z)
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge? [15.964825460186393]
We formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework.<n>We propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions.<n> Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty.
arXiv Detail & Related papers (2025-05-05T14:21:08Z)
Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models [10.041289551532804]
We introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery. To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA) UMA actively probes models for forgotten traces using adversarial queries.
arXiv Detail & Related papers (2025-04-21T01:56:15Z)
Open Problems in Machine Unlearning for AI Safety [61.43515658834902]
Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety.
arXiv Detail & Related papers (2025-01-09T03:59:10Z)
OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z)
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models [19.015202590038996]
We design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack unlearned models. We propose Latent Adrial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. We demonstrate that LAU improves unlearning effectiveness by over $53.5%$, cause only less than a $11.6%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.
arXiv Detail & Related papers (2024-08-20T09:36:04Z)
Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models [52.03511469562013]
We introduce the Iterative Contrastive Unlearning (ICU) framework, which consists of three core components. A Knowledge Unlearning Induction module targets specific knowledge for removal using an unlearning loss. A Contrastive Learning Enhancement module preserves the model's expressive capabilities against the pure unlearning goal. An Iterative Unlearning Refinement module dynamically adjusts the unlearning process through ongoing evaluation and updates.
arXiv Detail & Related papers (2024-07-25T07:09:35Z)
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks [85.84979847888157]
Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks.<n>LLMs can implicitly unlearn harmful knowledge that was not explicitly introduced during the unlearning phase.<n>We empirically validate this phenomenon, which makes unlearning-based methods able to decrease the Attack Success Rate.
arXiv Detail & Related papers (2024-07-03T07:14:05Z)
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI [50.61495097098296]
We revisit the paradigm in which unlearning is used for Large Language Models (LLMs) We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context. We argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation.
arXiv Detail & Related papers (2024-06-27T10:24:35Z)
Improving Alignment and Robustness with Circuit Breakers [40.4558948850276]
We present an approach that interrupts the models as they respond with harmful outputs with "circuit breakers" As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs. We extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack.
arXiv Detail & Related papers (2024-06-06T17:57:04Z)
Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models [64.5204594279587]
A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm. We propose to balance safety and helpfulness in diverse use cases by controlling both attributes in large language models.
arXiv Detail & Related papers (2024-04-01T17:59:06Z)
Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features. backdoor attacks subtly embed malicious behaviors within the model during training. We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z)
Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges. Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs. This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z)
Inspect, Understand, Overcome: A Survey of Practical Methods for AI Safety [54.478842696269304]
The use of deep neural networks (DNNs) in safety-critical applications is challenging due to numerous model-inherent shortcomings. In recent years, a zoo of state-of-the-art techniques aiming to address these safety concerns has emerged. Our paper addresses both machine learning experts and safety engineers.
arXiv Detail & Related papers (2021-04-29T09:54:54Z)
Adversarial Training is Not Ready for Robot Learning [55.493354071227174]
Adversarial training is an effective method to train deep learning models that are resilient to norm-bounded perturbations. We show theoretically and experimentally that neural controllers obtained via adversarial training are subjected to three types of defects. Our results suggest that adversarial training is not yet ready for robot learning.
arXiv Detail & Related papers (2021-03-15T07:51:31Z)
Modeling Penetration Testing with Reinforcement Learning Using Capture-the-Flag Challenges: Trade-offs between Model-free Learning and A Priori Knowledge [0.0]
Penetration testing is a security exercise aimed at assessing the security of a system by simulating attacks against it. This paper focuses on simplified penetration testing problems expressed in the form of capture the flag hacking challenges. We show how this challenge may be eased by relying on different forms of prior knowledge that may be provided to the agent.
arXiv Detail & Related papers (2020-05-26T11:23:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.