Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning
- URL: http://arxiv.org/abs/2602.19612v2
- Date: Tue, 24 Feb 2026 10:56:28 GMT
- Title: Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning
- Authors: Borisiuk Anna, Andrey Savchenko, Alexander Panchenko, Elena Tutubalina,
- Abstract summary: We study whether forgotten knowledge originates from pretraining or supervised fine-tuning.<n>Our experiments show that pretrained and SFT models respond differently to unlearning.
- Score: 59.19460954480119
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.
Related papers
- Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs [31.768387661474904]
Unlearning in large language models (LLMs) involves precisely removing specific information from a pre-trained model.<n>This is crucial to ensure safety of LLMs by deleting private data or harmful knowledge acquired during pre-training.<n>We introduce JensUn, where we leverage the Jensen-Shannon Divergence as the training objective for both forget and retain sets.<n>In extensive experiments, JensUn achieves better forget-utility trade-off than competing methods, and even demonstrates strong resilience to benign relearning.
arXiv Detail & Related papers (2025-09-02T20:38:53Z) - Distillation Robustifies Unlearning [36.27570321651185]
We show that training a model can drastically modify its input-output behavior while leaving its underlying capabilities intact.<n>We propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a noised copy of itself.
arXiv Detail & Related papers (2025-06-06T17:58:54Z) - UniErase: Towards Balanced and Precise Unlearning in Language Models [69.04923022755547]
Large language models (LLMs) require iterative updates to address the outdated information problem.<n>UniErase is a novel unlearning framework that demonstrates precision and balanced performances between knowledge unlearning and ability retaining.
arXiv Detail & Related papers (2025-05-21T15:53:28Z) - Not All Data Are Unlearned Equally [33.770024777468336]
We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model.<n>We uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger.
arXiv Detail & Related papers (2025-04-07T13:29:02Z) - Data Unlearning in Diffusion Models [44.99833362998488]
General-purpose machine unlearning techniques were found to be either unstable or failed to unlearn data.<n>We propose a family of new loss functions called Subtracted Importance Sampled Scores (SISS) that utilize importance sampling and are the first method to unlearn data with theoretical guarantees.
arXiv Detail & Related papers (2025-03-02T21:36:04Z) - TAPE: Tailored Posterior Difference for Auditing of Machine Unlearning [19.99300962254467]
We propose a TAilored Posterior diffErence (TAPE) method to provide unlearning auditing independently of original model training.<n>TAPE mimics unlearned posterior differences by quickly building unlearned shadow models.<n>We train a Reconstructor model to extract and evaluate the private information of the unlearned posterior differences to audit unlearning.
arXiv Detail & Related papers (2025-02-27T05:13:54Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges.
Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not.
We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning)
Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Machine Unlearning of Features and Labels [72.81914952849334]
We propose first scenarios for unlearning and labels in machine learning models.
Our approach builds on the concept of influence functions and realizes unlearning through closed-form updates of model parameters.
arXiv Detail & Related papers (2021-08-26T04:42:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.