Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization
- URL: http://arxiv.org/abs/2509.20230v3
- Date: Tue, 30 Sep 2025 13:04:46 GMT
- Title: Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization
- Authors: Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding,
- Abstract summary: We propose a bi-level feedback-guided optimization framework that explicitly seeks more stable parameter regions.<n>Experiments on WMDP and MUSE benchmarks demonstrate that our method is significantly more robust against both relearning and jailbreaking attacks.
- Score: 37.965539404740774
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current LLM unlearning methods face a critical security vulnerability that undermines their fundamental purpose: while they appear to successfully remove sensitive or harmful knowledge, this ``forgotten" information remains precariously recoverable through relearning attacks. We identify that the root cause is that conventional methods optimizing the forgetting loss at individual data points will drive model parameters toward sharp minima in the loss landscape. In these unstable regions, even minimal parameter perturbations can drastically alter the model's behaviors. Consequently, relearning attacks exploit this vulnerability by using just a few fine-tuning samples to navigate the steep gradients surrounding these unstable regions, thereby rapidly recovering knowledge that was supposedly erased. This exposes a critical robustness gap between apparent unlearning and actual knowledge removal. To address this issue, we propose StableUN, a bi-level feedback-guided optimization framework that explicitly seeks more stable parameter regions via neighborhood-aware optimization. It integrates forgetting feedback, which uses adversarial perturbations to probe parameter neighborhoods, with remembering feedback to preserve model utility, aligning the two objectives through gradient projection. Experiments on WMDP and MUSE benchmarks demonstrate that our method is significantly more robust against both relearning and jailbreaking attacks while maintaining competitive utility performance.
Related papers
- Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - Curvature-Aware Safety Restoration In LLMs Fine-Tuning [25.423475514922725]
Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment.<n>We propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization.<n>Our approach efficiently reduces harmful responses while maintaining or even improving utility and few-shot learning performance.
arXiv Detail & Related papers (2025-11-22T12:33:31Z) - Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning [25.53799024782883]
Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model.<n>Recent findings reveal that unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting.
arXiv Detail & Related papers (2025-10-01T10:50:14Z) - Stable Forgetting: Bounded Parameter-Efficient Unlearning in LLMs [30.089412595436585]
We provide a theoretical framework that explains how ascent on the forget set destabilizes optimization in the feedforward layers of large language models (LLMs)<n>We propose Bounded Bounded Unlearning, a parameter-efficient approach that stabilizes fine-tuning by applying bounded functions to adapters.<n>Our method achieves substantial improvements in forgetting while preserving retention, establishing a novel theoretically grounded and practically scalable framework for unlearning in LLMs.
arXiv Detail & Related papers (2025-09-29T01:30:15Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond [41.3029262040131]
We investigate how to make unlearned models robust against relearning attacks.<n>Our analysis reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks.
arXiv Detail & Related papers (2025-02-07T23:03:55Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - A Universal Class of Sharpness-Aware Minimization Algorithms [57.29207151446387]
We introduce a new class of sharpness measures, leading to new sharpness-aware objective functions.
We prove that these measures are textitly expressive, allowing any function of the training loss Hessian matrix to be represented by appropriate hyper and determinants.
arXiv Detail & Related papers (2024-06-06T01:52:09Z) - Improving Data-aware and Parameter-aware Robustness for Continual Learning [3.480626767752489]
This paper analyzes that this insufficiency arises from the ineffective handling of outliers.
We propose a Robust Continual Learning (RCL) method to address this issue.
The proposed method effectively maintains robustness and achieves new state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2024-05-27T11:21:26Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.