Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models
- URL: http://arxiv.org/abs/2504.14798v1
- Date: Mon, 21 Apr 2025 01:56:15 GMT
- Title: Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models
- Authors: Hao Xuan, Xingyu Li,
- Abstract summary: We introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery.<n>To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA)<n>UMA actively probes models for forgotten traces using adversarial queries.
- Score: 10.041289551532804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine Unlearning (MUL) is crucial for privacy protection and content regulation, yet recent studies reveal that traces of forgotten information persist in unlearned models, enabling adversaries to resurface removed knowledge. Existing verification methods only confirm whether unlearning was executed, failing to detect such residual information leaks. To address this, we introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery. To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA), a post-unlearning verification framework that actively probes models for forgotten traces using adversarial queries. Extensive experiments on discriminative and generative tasks show that existing unlearning techniques remain vulnerable, even when passing existing verification metrics. By establishing UMA as a practical verification tool, this study sets a new standard for assessing and enhancing machine unlearning security.
Related papers
- Redefining Machine Unlearning: A Conformal Prediction-Motivated Approach [1.3731623617634434]
We identify critical limitations in existing unlearning metrics and propose enhanced evaluation metrics inspired by conformal prediction.<n>Our metrics can effectively capture the extent to which ground truth labels are excluded from the prediction set.<n>We propose an unlearning framework that integrates conformal prediction insights into Carlini & Wagner adversarial attack loss.
arXiv Detail & Related papers (2025-01-31T18:58:43Z) - RESTOR: Knowledge Recovery through Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can memorize undesirable datapoints.<n>Many machine unlearning algorithms have been proposed that aim to erase' these datapoints.<n>We propose the RESTOR framework for machine unlearning, which evaluates the ability of unlearning algorithms to perform targeted data erasure.
arXiv Detail & Related papers (2024-10-31T20:54:35Z) - Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models [19.015202590038996]
We design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack unlearned models.
We propose Latent Adrial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process.
We demonstrate that LAU improves unlearning effectiveness by over $53.5%$, cause only less than a $11.6%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.
arXiv Detail & Related papers (2024-08-20T09:36:04Z) - Verification of Machine Unlearning is Fragile [48.71651033308842]
We introduce two novel adversarial unlearning processes capable of circumventing both types of verification strategies.
This study highlights the vulnerabilities and limitations in machine unlearning verification, paving the way for further research into the safety of machine unlearning.
arXiv Detail & Related papers (2024-08-01T21:37:10Z) - Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models [52.03511469562013]
We introduce the Iterative Contrastive Unlearning (ICU) framework, which consists of three core components.<n>A Knowledge Unlearning Induction module targets specific knowledge for removal using an unlearning loss.<n>A Contrastive Learning Enhancement module preserves the model's expressive capabilities against the pure unlearning goal.<n>An Iterative Unlearning Refinement module dynamically adjusts the unlearning process through ongoing evaluation and updates.
arXiv Detail & Related papers (2024-07-25T07:09:35Z) - UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI [50.61495097098296]
We revisit the paradigm in which unlearning is used for Large Language Models (LLMs)
We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context.
We argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation.
arXiv Detail & Related papers (2024-06-27T10:24:35Z) - Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning [37.061187080745654]
We show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of $textitbenign relearning attacks.<n>With access to only a small and potentially loosely related set of data, we find that we can ''jog'' the memory of unlearned models to reverse the effects of unlearning.
arXiv Detail & Related papers (2024-06-19T09:03:21Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.
It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning [16.809644622465086]
We conduct the first investigation to understand the extent to which machine unlearning can leak the confidential content of unlearned data.
Under the Machine Learning as a Service setting, we propose unlearning inversion attacks that can reveal the feature and label information of an unlearned sample.
The experimental results indicate that the proposed attack can reveal the sensitive information of the unlearned data.
arXiv Detail & Related papers (2024-04-04T06:37:46Z) - Learning to Unlearn: Instance-wise Unlearning for Pre-trained
Classifiers [71.70205894168039]
We consider instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model.
We propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information.
arXiv Detail & Related papers (2023-01-27T07:53:50Z) - Adversarial Targeted Forgetting in Regularization and Generative Based
Continual Learning Models [2.8021833233819486]
Continual (or "incremental") learning approaches are employed when additional knowledge or tasks need to be learned from subsequent batches or from streaming data.
We show that an intelligent adversary can take advantage of a continual learning algorithm's capabilities of retaining existing knowledge over time.
We show that the adversary can create a "false memory" about any task by inserting carefully-designed backdoor samples to the test instances of that task.
arXiv Detail & Related papers (2021-02-16T18:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.