Suppression or Deletion: A Restoration-Based Representation-Level Analysis of Machine Unlearning
- URL: http://arxiv.org/abs/2602.18505v1
- Date: Wed, 18 Feb 2026 07:46:30 GMT
- Title: Suppression or Deletion: A Restoration-Based Representation-Level Analysis of Machine Unlearning
- Authors: Yurim Jang, Jaeung Lee, Dohyun Kim, Jaemin Jo, Simon S. Woo,
- Abstract summary: We propose a novel restoration-based analysis framework for machine unlearning.<n>Applying our framework to 12 major unlearning methods in image classification tasks, we find that most methods achieve high restoration rates.<n>We propose new evaluation guidelines that prioritize representation-level verification.
- Score: 24.40457827994831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As pretrained models are increasingly shared on the web, ensuring that models can forget or delete sensitive, copyrighted, or private information upon request has become crucial. Machine unlearning has been proposed to address this challenge. However, current evaluations for unlearning methods rely on output-based metrics, which cannot verify whether information is completely deleted or merely suppressed at the representation level, where suppression is insufficient for true unlearning. To address this gap, we propose a novel restoration-based analysis framework that uses Sparse Autoencoders to identify class-specific expert features in intermediate layers and applies inference-time steering to quantitatively distinguish between suppression and deletion. Applying our framework to 12 major unlearning methods in image classification tasks, we find that most methods achieve high restoration rates of unlearned information, indicating that they only suppress information at the decision-boundary level, while preserving semantic features in intermediate representations. Notably, even retraining from pretrained checkpoints shows high restoration, revealing that robust semantic features inherited from pretraining are not removed by retraining. These results demonstrate that representation-level retention poses significant risks overlooked by output-based metrics, highlighting the need for new unlearning evaluation criteria. We propose new evaluation guidelines that prioritize representation-level verification, especially for privacy-critical applications in the era of pre-trained models.
Related papers
- Auditing Language Model Unlearning via Information Decomposition [68.48660428111593]
We introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID)<n>By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge.<n>Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.
arXiv Detail & Related papers (2026-01-21T15:51:19Z) - LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data [69.5099112089508]
Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data.<n>This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets.<n>We find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved.
arXiv Detail & Related papers (2025-10-10T05:10:49Z) - Scalable and Robust LLM Unlearning by Correcting Responses with Retrieved Exclusions [49.55618517046225]
Language models trained on web-scale corpora risk memorizing and exposing sensitive information.<n>We propose Corrective Unlearning with Retrieved Exclusions (CURE), a novel unlearning framework.<n>CURE verifies model outputs for leakage and revises them into safe responses.
arXiv Detail & Related papers (2025-09-30T09:07:45Z) - Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs [38.837810490068556]
Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity.<n>We demonstrate that models can appear to forget while their original behavior is easily restored through minimal fine-tuning.<n>This phenomenon of emphreversibility suggests that information is merely suppressed, not genuinely erased.
arXiv Detail & Related papers (2025-05-22T16:02:10Z) - Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols [14.961054239793356]
We introduce a rigorous unlearning evaluation setup, in which forgetting classes exhibit semantic similarity to downstream task classes.<n>We hope our benchmark serves as a standardized protocol for evaluating unlearning algorithms under realistic conditions.
arXiv Detail & Related papers (2025-03-10T07:11:34Z) - RESTOR: Knowledge Recovery in Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can contain private or sensitive information.<n>Several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints.<n>We propose the RESTOR framework for machine unlearning evaluation.
arXiv Detail & Related papers (2024-10-31T20:54:35Z) - Revisiting Machine Unlearning with Dimensional Alignment [46.29161970063835]
We introduce a novel evaluation metric for machine unlearning, coined dimensional alignment, which measures the alignment between the eigenspaces of the forget and retain set samples.<n>Our framework effectively eliminates information from the forget set and preserves knowledge from the retain set.
arXiv Detail & Related papers (2024-07-25T02:05:15Z) - Partially Blinded Unlearning: Class Unlearning for Deep Networks a Bayesian Perspective [4.31734012105466]
Machine Unlearning is the process of selectively discarding information designated to specific sets or classes of data from a pre-trained model.
We propose a methodology tailored for the purposeful elimination of information linked to a specific class of data from a pre-trained classification network.
Our novel approach, termed textbfPartially-Blinded Unlearning (PBU), surpasses existing state-of-the-art class unlearning methods, demonstrating superior effectiveness.
arXiv Detail & Related papers (2024-03-24T17:33:22Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.<n>XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.<n>Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Learning to Unlearn: Instance-wise Unlearning for Pre-trained
Classifiers [71.70205894168039]
We consider instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model.
We propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information.
arXiv Detail & Related papers (2023-01-27T07:53:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.