Related papers: Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

URL: http://arxiv.org/abs/2506.12484v3
Date: Mon, 30 Jun 2025 09:37:43 GMT
Title: Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Authors: Filip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys,
Abstract summary: Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning.<n>Recent studies show that even specialized unlearning methods can be easily reversed.<n>We introduce Disruption Masking, a technique in which we only allow updating weights.
Score: 0.562479170374811
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.

Related papers

MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering [36.80441487363007]
MLLMEraser is an input-aware, training-free framework for test-time unlearning.<n>We construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs.<n>Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines.
arXiv Detail & Related papers (2025-10-05T14:20:17Z)
Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection [17.369869625390894]
We propose a Metamorphosis Representation Projection (MRP) approach to machine unlearning.<n>By implementing projective transformations in the hidden state space of specific network layers, our method effectively eliminates harmful information while preserving useful knowledge.<n> Experimental results demonstrate that our approach enables effective continuous unlearning and successfully defends against relearning attacks.
arXiv Detail & Related papers (2025-08-21T11:12:09Z)
UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models [54.75551043657238]
We introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors.<n>UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings.
arXiv Detail & Related papers (2025-05-21T15:53:28Z)
Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models [10.041289551532804]
We introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery.<n>To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA)<n>UMA actively probes models for forgotten traces using adversarial queries.
arXiv Detail & Related papers (2025-04-21T01:56:15Z)
Catastrophic Failure of LLM Unlearning via Quantization [36.524827594501495]
We show that applying quantization to models that have undergone unlearning can restore the "forgotten" information.<n>We find that for unlearning methods with utility constraints, the unlearned model retains an average of 21% of the intended forgotten knowledge in full precision.
arXiv Detail & Related papers (2024-10-21T19:28:37Z)
An Adversarial Perspective on Machine Unlearning for AI Safety [22.639683142004372]
This work challenges the fundamental differences between unlearning and traditional safety post-training.<n>We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully.<n>For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU.
arXiv Detail & Related papers (2024-09-26T16:32:19Z)
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models [19.015202590038996]
We design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack unlearned models. We propose Latent Adrial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. We demonstrate that LAU improves unlearning effectiveness by over $53.5%$, cause only less than a $11.6%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.
arXiv Detail & Related papers (2024-08-20T09:36:04Z)
Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models [52.03511469562013]
We introduce the Iterative Contrastive Unlearning (ICU) framework, which consists of three core components.<n>A Knowledge Unlearning Induction module targets specific knowledge for removal using an unlearning loss.<n>A Contrastive Learning Enhancement module preserves the model's expressive capabilities against the pure unlearning goal.<n>An Iterative Unlearning Refinement module dynamically adjusts the unlearning process through ongoing evaluation and updates.
arXiv Detail & Related papers (2024-07-25T07:09:35Z)
UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models [12.45822383965784]
We introduce UnDIAL (Unlearning via Self-Distillation on Adjusted Logits), a novel and robust unlearning method. Our approach leverages self-distillation to adjust logits and selectively reduce the influence of targeted tokens.
arXiv Detail & Related papers (2024-02-15T16:21:14Z)
Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data. This process might suffer from privacy issues and violations of data protection regulations. We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z)
On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning [100.14809391594109]
Model-agnostic meta-learning (MAML) has emerged as one of the most successful meta-learning techniques in few-shot learning. Despite the generalization power of the meta-model, it remains elusive that how adversarial robustness can be maintained by MAML in few-shot learning. We propose a general but easily-optimized robustness-regularized meta-learning framework, which allows the use of unlabeled data augmentation, fast adversarial attack generation, and computationally-light fine-tuning.
arXiv Detail & Related papers (2021-02-20T22:03:04Z)
Incremental Object Detection via Meta-Learning [77.55310507917012]
We propose a meta-learning approach that learns to reshape model gradients, such that information across incremental tasks is optimally shared. In comparison to existing meta-learning methods, our approach is task-agnostic, allows incremental addition of new-classes and scales to high-capacity models for object detection.
arXiv Detail & Related papers (2020-03-17T13:40:00Z)
Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning [74.07455280246212]
Continual learning studies agents that learn from streams of tasks without forgetting previous ones while adapting to new ones. We show that current continual learning, meta-learning, meta-continual learning, and continual-meta learning techniques fail in this new scenario. We propose Continual-MAML, an online extension of the popular MAML algorithm as a strong baseline for this scenario.
arXiv Detail & Related papers (2020-03-12T15:47:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.