LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data
- URL: http://arxiv.org/abs/2510.09007v1
- Date: Fri, 10 Oct 2025 05:10:49 GMT
- Title: LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data
- Authors: Changsheng Wang, Yihua Zhang, Dennis Wei, Jinghan Jia, Pin-Yu Chen, Sijia Liu,
- Abstract summary: Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data.<n>This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets.<n>We find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved.
- Score: 69.5099112089508
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.
Related papers
- Forgetting-MarI: LLM Unlearning via Marginal Information Regularization [6.979586479353831]
Existing unlearning methods often degrade model performance by removing more information than necessary when attempting to ''forget'' specific data.<n>We introduce Forgetting-MarI, an LLM unlearning framework that provably removes only the additional (marginal) information contributed by the data to be unlearned.<n>By penalizing marginal information, our method yields an explicit upper bound on the unlearn dataset's residual influence in the trained models, providing provable undetectability.
arXiv Detail & Related papers (2025-11-14T22:48:39Z) - Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning [50.45435841411193]
Code Language Models (CLMs) exhibit unintended memorization of sensitive training data, enabling verbatim reproduction of confidential information when specifically prompted.<n>We introduce CodeEraser, an advanced variant that selectively unlearns sensitive memorized segments in code while preserving the structural integrity and functional correctness of the surrounding code.
arXiv Detail & Related papers (2025-09-17T07:12:35Z) - LoReUn: Data Itself Implicitly Provides Cues to Improve Machine Unlearning [33.62466543549043]
Loss-based Reweighting Unlearning (LoReUn) is a plug-and-play strategy that dynamically reweights data during the unlearning process with minimal additional computational overhead.<n>Our approach significantly reduces the gap between existing MU methods and exact unlearning in both image classification and generation tasks.
arXiv Detail & Related papers (2025-07-30T09:12:25Z) - Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs [38.837810490068556]
Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity.<n>We demonstrate that models can appear to forget while their original behavior is easily restored through minimal fine-tuning.<n>This phenomenon of emphreversibility suggests that information is merely suppressed, not genuinely erased.
arXiv Detail & Related papers (2025-05-22T16:02:10Z) - GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection [36.38245533018162]
Large Language Models (LLMs) have demonstrated strong capabilities in memorizing vast amounts of knowledge across diverse domains.<n>Existing unlearning efforts typically fine-tune the model with resources such as forget data, retain data, and a calibration model.<n>We propose Generation-time Unlearning via Adaptive Restriction and Detection (GUARD), a framework that enables dynamic unlearning during LLM generation.
arXiv Detail & Related papers (2025-05-19T16:26:58Z) - CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept [5.345828824625758]
We propose a novel amortized unlearning approach using codebook features and Sparse Autoencoders (SAEs)
By leveraging a bottleneck to decompose the activation space and regulate information flow, our method efficiently unlearns targeted information while preserving the model's performance on unrelated data.
arXiv Detail & Related papers (2024-10-08T10:26:22Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - TOFU: A Task of Fictitious Unlearning for LLMs [99.92305790945507]
Large language models trained on massive corpora of data from the web can reproduce sensitive or private data raising both legal and ethical concerns.
Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training.
We present TOFU, a benchmark aimed at helping deepen our understanding of unlearning.
arXiv Detail & Related papers (2024-01-11T18:57:12Z) - Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data.
This process might suffer from privacy issues and violations of data protection regulations.
We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.<n>XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.<n>Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - MILD: Modeling the Instance Learning Dynamics for Learning with Noisy
Labels [19.650299232829546]
We propose an iterative selection approach based on the Weibull mixture model to identify clean data.
In particular, we measure the difficulty of memorization and memorize for each instance via the transition times between being misclassified and being memorized.
Our strategy outperforms existing noisy-label learning methods.
arXiv Detail & Related papers (2023-06-20T14:26:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.