Related papers: The Unseen Threat: Residual Knowledge in Machine Unlearning under Perturbed Samples

The Unseen Threat: Residual Knowledge in Machine Unlearning under Perturbed Samples

URL: http://arxiv.org/abs/2601.22359v1
Date: Thu, 29 Jan 2026 22:10:13 GMT
Title: The Unseen Threat: Residual Knowledge in Machine Unlearning under Perturbed Samples
Authors: Hsiang Hsu, Pradeep Niroula, Zichang He, Ivan Brugere, Freddy Lecue, Chun-Fu Chen,
Abstract summary: We show that slight perturbeds of forget samples may still be correctly recognized by the unlearned model.<n>We propose a fine-tuning strategy, named RURK, that penalizes the model's ability to re-recognize forget samples.
Score: 16.030881842099998
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine unlearning offers a practical alternative to avoid full model re-training by approximately removing the influence of specific user data. While existing methods certify unlearning via statistical indistinguishability from re-trained models, these guarantees do not naturally extend to model outputs when inputs are adversarially perturbed. In particular, slight perturbations of forget samples may still be correctly recognized by the unlearned model - even when a re-trained model fails to do so - revealing a novel privacy risk: information about the forget samples may persist in their local neighborhood. In this work, we formalize this vulnerability as residual knowledge and show that it is inevitable in high-dimensional settings. To mitigate this risk, we propose a fine-tuning strategy, named RURK, that penalizes the model's ability to re-recognize perturbed forget samples. Experiments on vision benchmarks with deep neural networks demonstrate that residual knowledge is prevalent across existing unlearning methods and that our approach effectively prevents residual knowledge.

Related papers

ROKA: Robust Knowledge Unlearning against Adversaries [0.9236074230806578]
We introduce a new unlearning-induced attack model, namely indirect unlearning attack, which does not require data manipulation but exploits the consequence of knowledge contamination to perturb the model accuracy on security-critical predictions.<n>Our work is the first to provide a theoretical guarantee for knowledge preservation during unlearning. Evaluations on various large models, including vision transformers, multi-modal models, and large language models, show that ROKA effectively unlearns targets while preserving, or even enhancing, the accuracy of retained data.
arXiv Detail & Related papers (2026-02-28T03:30:39Z)
REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs [0.1784233255402269]
Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining.<n>We propose REMIND, a novel evaluation method aiming to detect the subtle remaining influence of unlearned data.<n>We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns.
arXiv Detail & Related papers (2025-11-06T09:58:19Z)
Probing Knowledge Holes in Unlearned LLMs [23.377732810945172]
Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training.<n>We find that unlearning may inadvertently create knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture.<n>We propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures.
arXiv Detail & Related papers (2025-10-27T03:11:53Z)
Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy [18.219835803238837]
We show that approximate unlearning algorithms fail to adequately protect the privacy of unlearned data.<n>We propose the Reminiscence Attack (ReA), which amplifies the correlation between residuals and membership privacy.<n>We develop a dual-phase approximate unlearning framework that first eliminates deep-layer unlearned data traces and then enforces convergence stability.
arXiv Detail & Related papers (2025-07-28T07:12:12Z)
Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models [10.041289551532804]
We introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery.<n>To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA)<n>UMA actively probes models for forgotten traces using adversarial queries.
arXiv Detail & Related papers (2025-04-21T01:56:15Z)
RESTOR: Knowledge Recovery in Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can contain private or sensitive information.<n>Several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints.<n>We propose the RESTOR framework for machine unlearning evaluation.
arXiv Detail & Related papers (2024-10-31T20:54:35Z)
Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features. backdoor attacks subtly embed malicious behaviors within the model during training. We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z)
Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning. Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset. We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU) We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers [71.70205894168039]
We consider instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model. We propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information.
arXiv Detail & Related papers (2023-01-27T07:53:50Z)
Disrupting Model Training with Adversarial Shortcuts [12.31803688544684]
We present a proof-of-concept approach for the image classification setting. We propose methods based on the notion of adversarial shortcuts, which encourage models to rely on non-robust signals rather than semantic features.
arXiv Detail & Related papers (2021-06-12T01:04:41Z)
Automatic Recall Machines: Internal Replay, Continual Learning and the Brain [104.38824285741248]
Replay in neural networks involves training on sequential data with memorized samples, which counteracts forgetting of previous behavior caused by non-stationarity. We present a method where these auxiliary samples are generated on the fly, given only the model that is being trained for the assessed objective. Instead the implicit memory of learned samples within the assessed model itself is exploited.
arXiv Detail & Related papers (2020-06-22T15:07:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.