Understanding the Dilemma of Unlearning for Large Language Models
- URL: http://arxiv.org/abs/2509.24675v1
- Date: Mon, 29 Sep 2025 12:15:19 GMT
- Title: Understanding the Dilemma of Unlearning for Large Language Models
- Authors: Qingjie Zhang, Haoting Qian, Zhicong Huang, Cheng Hong, Minlie Huang, Ke Xu, Chao Zhang, Han Qiu,
- Abstract summary: Unlearning seeks to remove specific knowledge from large language models (LLMs)<n>We propose unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking.
- Score: 50.54260066313032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unlearning seeks to remove specific knowledge from large language models (LLMs), but its effectiveness remains contested. On one side, "forgotten" knowledge can often be recovered through interventions such as light fine-tuning; on the other side, unlearning may induce catastrophic forgetting that degrades general capabilities. Despite active exploration of unlearning methods, interpretability analyses of the mechanism are scarce due to the difficulty of tracing knowledge in LLMs' complex architectures. We address this gap by proposing unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking. Typically, it quantifies each prompt token's influence on outputs, enabling pre- and post-unlearning comparisons to reveal what changes. Across six mainstream unlearning methods, three LLMs, and three benchmarks, we find that: (1) Unlearning appears to be effective by disrupting focus on keywords in prompt; (2) Much of the knowledge is not truly erased and can be recovered by simply emphasizing these keywords in prompts, without modifying the model's weights; (3) Catastrophic forgetting arises from indiscriminate penalization of all tokens. Taken together, our results suggest an unlearning dilemma: existing methods tend either to be insufficient - knowledge remains recoverable by keyword emphasis, or overly destructive - general performance collapses due to catastrophic forgetting, still leaving a gap to reliable unlearning.
Related papers
- Understanding Empirical Unlearning with Combinatorial Interpretability [11.245092170419227]
Recently developed framework of interpretability enables direct inspection of knowledge encoded in model weights.<n>We reproduce baseline unlearning methods within the interpretability setting and examine their behavior along two dimensions.<n>Our results shed light within a fully interpretable setting on how knowledge can persist despite unlearning and when it might resurface.
arXiv Detail & Related papers (2026-02-22T14:51:48Z) - Learning-Time Encoding Shapes Unlearning in LLMs [25.48677069802298]
We investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge.<n>Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.
arXiv Detail & Related papers (2025-06-18T02:42:02Z) - Step-by-Step Reasoning Attack: Revealing 'Erased' Knowledge in Large Language Models [9.719371187651591]
Unlearning techniques suppress and leave the knowledge beneath the surface, thus making it retrievable with the right prompts.<n>We introduce a step-by-step reasoning-based black-box attack, Sleek, that systematically exposes unlearning failures.<n>Of the generated adversarial prompts, 62.5% successfully retrieved forgotten Harry Potter facts from WHP-unlearned Llama, while 50% exposed unfair suppression of retained knowledge.
arXiv Detail & Related papers (2025-06-14T04:22:17Z) - Unlearning through Knowledge Overwriting: Reversible Federated Unlearning via Selective Sparse Adapter [35.65566527544619]
Federated learning is a promising paradigm for privacy-preserving collaborative model training.<n>We propose FUSED, which first identifies critical layers by analyzing each layer's sensitivity to knowledge.<n> adapters are trained without altering the original parameters, overwriting the unlearning knowledge with the remaining knowledge.
arXiv Detail & Related papers (2025-02-28T04:35:26Z) - Disentangling Memory and Reasoning Ability in Large Language Models [97.26827060106581]
We propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions.<n>Our experiment results show that this decomposition improves model performance and enhances the interpretability of the inference process.
arXiv Detail & Related papers (2024-11-20T17:55:38Z) - A Closer Look at Machine Unlearning for Large Language Models [46.245404272612795]
Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns.<n>We discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches.
arXiv Detail & Related papers (2024-10-10T16:56:05Z) - UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI [50.61495097098296]
We revisit the paradigm in which unlearning is used for Large Language Models (LLMs)
We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context.
We argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation.
arXiv Detail & Related papers (2024-06-27T10:24:35Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Causal Reinforcement Learning: A Survey [57.368108154871]
Reinforcement learning is an essential paradigm for solving sequential decision problems under uncertainty.
One of the main obstacles is that reinforcement learning agents lack a fundamental understanding of the world.
Causality offers a notable advantage as it can formalize knowledge in a systematic manner.
arXiv Detail & Related papers (2023-07-04T03:00:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.