LLM Unlearning Should Be Form-Independent
- URL: http://arxiv.org/abs/2506.07795v1
- Date: Mon, 09 Jun 2025 14:21:25 GMT
- Title: LLM Unlearning Should Be Form-Independent
- Authors: Xiaotian Ye, Mengqi Zhang, Shu Wu,
- Abstract summary: Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model.<n>We identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples.<n>We introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path.
- Score: 14.222205207889543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model, offering promise for controlling harmful or private information to prevent misuse. However, recent studies highlight its limited efficacy in real-world scenarios, hindering practical adoption. In this study, we identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples and frequently fails to generalize to alternate expressions of the same knowledge. We formally characterize this problem as Form-Dependent Bias and systematically investigate its specific manifestation patterns across various downstream tasks. To quantify its prevalence and support future research, we introduce ORT, a novel benchmark designed to evaluate the robustness of unlearning methods against variations in knowledge expression. Results reveal that Form-Dependent Bias is both widespread and severe among current techniques. We argue that LLM unlearning should be form-independent to address the endless forms of downstream tasks encountered in real-world security-critical scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path. ROCR performs unlearning by targeting the invariants in downstream tasks, specifically the activated dangerous concepts. It is capable of modifying model parameters within seconds to redirect the model's perception of a specific unlearning target concept to another harmless concept. Extensive experiments demonstrate that ROCR significantly improves unlearning effectiveness compared to traditional methods while generating highly natural outputs.
Related papers
- Sparsity-Aware Unlearning for Large Language Models [20.699929903336113]
Large Language Models (LLMs) inevitably memorize sensitive information during training, posing significant privacy risks.<n>Machine unlearning has emerged as a promising solution to selectively remove such information without full retraining.<n>We find that unlearning effectiveness degrades substantially on sparse models.<n>We propose Sparsity-Aware Unlearning (SAU), which decouples unlearning from sparsification objectives through gradient masking.
arXiv Detail & Related papers (2026-01-31T07:45:30Z) - From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check [60.77691669644931]
We propose Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models.<n>We show that FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning.<n>These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
arXiv Detail & Related papers (2025-10-14T20:50:30Z) - LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics [10.638045151201084]
We present a principled taxonomy of twelve recent stateful unlearning methods.<n>We revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob)<n>Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective.
arXiv Detail & Related papers (2025-10-08T23:47:05Z) - Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs [31.768387661474904]
Unlearning in large language models (LLMs) involves precisely removing specific information from a pre-trained model.<n>This is crucial to ensure safety of LLMs by deleting private data or harmful knowledge acquired during pre-training.<n>We introduce JensUn, where we leverage the Jensen-Shannon Divergence as the training objective for both forget and retain sets.<n>In extensive experiments, JensUn achieves better forget-utility trade-off than competing methods, and even demonstrates strong resilience to benign relearning.
arXiv Detail & Related papers (2025-09-02T20:38:53Z) - Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection [17.369869625390894]
We propose a Metamorphosis Representation Projection (MRP) approach to machine unlearning.<n>By implementing projective transformations in the hidden state space of specific network layers, our method effectively eliminates harmful information while preserving useful knowledge.<n> Experimental results demonstrate that our approach enables effective continuous unlearning and successfully defends against relearning attacks.
arXiv Detail & Related papers (2025-08-21T11:12:09Z) - UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models [54.75551043657238]
We introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors.<n>UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings.
arXiv Detail & Related papers (2025-05-21T15:53:28Z) - SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning [62.18315467642528]
Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting.<n>Superficial forgetting refers to cases where the model's knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats.<n>By contrast, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge.
arXiv Detail & Related papers (2025-05-05T09:09:41Z) - Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning [93.58897637077001]
This paper tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints.<n>We pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos.<n>For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model.
arXiv Detail & Related papers (2025-03-11T13:50:22Z) - Sustainable Self-evolution Adversarial Training [51.25767996364584]
We propose a Sustainable Self-Evolution Adversarial Training (SSEAT) framework for adversarial training defense models.<n>We introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples.<n>We also propose an adversarial data replay module to better select more diverse and key relearning data.
arXiv Detail & Related papers (2024-12-03T08:41:11Z) - Diffusing States and Matching Scores: A New Framework for Imitation Learning [16.941612670582522]
Adversarial Imitation Learning is traditionally framed as a two-player zero-sum game between a learner and an adversarially chosen cost function.<n> diffusion models have emerged as a non-adversarial alternative to GANs that merely require training a score function via regression.<n>We show our approach outperforms both GAN-style imitation learning baselines and discriminator-free imitation learning baselines across various continuous control problems.
arXiv Detail & Related papers (2024-10-17T17:59:25Z) - Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning [8.831339626121848]
Concept unlearning is a promising solution to unethical or harmful use of text-to-image diffusion models.<n>Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW.<n>Our investigation reveals that no single method excels across all evaluation criteria.
arXiv Detail & Related papers (2024-10-08T03:30:39Z) - Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models [19.015202590038996]
We design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack unlearned models.
We propose Latent Adrial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process.
We demonstrate that LAU improves unlearning effectiveness by over $53.5%$, cause only less than a $11.6%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.
arXiv Detail & Related papers (2024-08-20T09:36:04Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Unlearning Backdoor Attacks through Gradient-Based Model Pruning [10.801476967873173]
We propose a novel approach to counter backdoor attacks by treating their mitigation as an unlearning task.
Our approach offers simplicity and effectiveness, rendering it well-suited for scenarios with limited data availability.
arXiv Detail & Related papers (2024-05-07T00:36:56Z) - Model Will Tell: Training Membership Inference for Diffusion Models [15.16244745642374]
Training Membership Inference (TMI) task aims to determine whether a specific sample has been used in the training process of a target model.
In this paper, we explore a novel perspective for the TMI task by leveraging the intrinsic generative priors within the diffusion model.
arXiv Detail & Related papers (2024-03-13T12:52:37Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.