Related papers: Auditing Language Model Unlearning via Information Decomposition

Auditing Language Model Unlearning via Information Decomposition

URL: http://arxiv.org/abs/2601.15111v1
Date: Wed, 21 Jan 2026 15:51:19 GMT
Title: Auditing Language Model Unlearning via Information Decomposition
Authors: Anmol Goel, Alan Ritter, Iryna Gurevych,
Abstract summary: We introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID)<n>By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge.<n>Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.
Score: 68.48660428111593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We expose a critical limitation in current approaches to machine unlearning in language models: despite the apparent success of unlearning algorithms, information about the forgotten data remains linearly decodable from internal representations. To systematically assess this discrepancy, we introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID). By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge. Our analysis reveals that redundant information, shared across both models, constitutes residual knowledge that persists post-unlearning and correlates with susceptibility to known adversarial reconstruction attacks. Leveraging these insights, we propose a representation-based risk score that can guide abstention on sensitive inputs at inference time, providing a practical mechanism to mitigate privacy leakage. Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.

Related papers

Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models [31.829376135133554]
Large language models often encounter conflicts between in-context knowledge retrieved at inference time and parametric knowledge acquired during pretraining.<n>We present the first controlled study of how training conditions influence models' use of in-context and parametric knowledge.<n>Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities.
arXiv Detail & Related papers (2025-09-29T06:18:18Z)
Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models [10.041289551532804]
We introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery.<n>To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA)<n>UMA actively probes models for forgotten traces using adversarial queries.
arXiv Detail & Related papers (2025-04-21T01:56:15Z)
Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols [14.961054239793356]
We introduce a rigorous unlearning evaluation setup, in which forgetting classes exhibit semantic similarity to downstream task classes.<n>We hope our benchmark serves as a standardized protocol for evaluating unlearning algorithms under realistic conditions.
arXiv Detail & Related papers (2025-03-10T07:11:34Z)
RESTOR: Knowledge Recovery in Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can contain private or sensitive information.<n>Several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints.<n>We propose the RESTOR framework for machine unlearning evaluation.
arXiv Detail & Related papers (2024-10-31T20:54:35Z)
Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts.<n>We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z)
Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective [68.20531518525273]
We take a closer look into existing self-supervised methods of speech from an information-theoretic perspective. We use linear probes to estimate the mutual information between the target information and learned representations. We explore the potential of evaluating representations in a self-supervised fashion, where we estimate the mutual information between different parts of the data without using any labels.
arXiv Detail & Related papers (2024-01-16T21:13:22Z)
RELAX: Representation Learning Explainability [10.831313203043514]
We propose RELAX, which is the first approach for attribution-based explanations of representations. ReLAX explains representations by measuring similarities in the representation space between an input and masked out versions of itself. We provide theoretical interpretations of RELAX and conduct a novel analysis of feature extractors trained using supervised and unsupervised learning.
arXiv Detail & Related papers (2021-12-19T14:51:31Z)
Which Mutual-Information Representation Learning Objectives are Sufficient for Control? [80.2534918595143]
Mutual information provides an appealing formalism for learning representations of data. This paper formalizes the sufficiency of a state representation for learning and representing the optimal policy. Surprisingly, we find that two of these objectives can yield insufficient representations given mild and common assumptions on the structure of the MDP.
arXiv Detail & Related papers (2021-06-14T10:12:34Z)
Bounding Information Leakage in Machine Learning [26.64770573405079]
This paper investigates fundamental bounds on information leakage. We identify and bound the success rate of the worst-case membership inference attack. We derive bounds on the mutual information between the sensitive attributes and model parameters.
arXiv Detail & Related papers (2021-05-09T08:49:14Z)
Explainable Recommender Systems via Resolving Learning Representations [57.24565012731325]
Explanations could help improve user experience and discover system defects. We propose a novel explainable recommendation model through improving the transparency of the representation learning process.
arXiv Detail & Related papers (2020-08-21T05:30:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.