Related papers: Understanding Empirical Unlearning with Combinatorial Interpretability

Understanding Empirical Unlearning with Combinatorial Interpretability

URL: http://arxiv.org/abs/2602.19215v1
Date: Sun, 22 Feb 2026 14:51:48 GMT
Title: Understanding Empirical Unlearning with Combinatorial Interpretability
Authors: Shingo Kodama, Niv Cohen, Micah Adler, Nir Shavit,
Abstract summary: Recently developed framework of interpretability enables direct inspection of knowledge encoded in model weights.<n>We reproduce baseline unlearning methods within the interpretability setting and examine their behavior along two dimensions.<n>Our results shed light within a fully interpretable setting on how knowledge can persist despite unlearning and when it might resurface.
Score: 11.245092170419227
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While many recent methods aim to unlearn or remove knowledge from pretrained models, seemingly erased knowledge often persists and can be recovered in various ways. Because large foundation models are far from interpretable, understanding whether and how such knowledge persists remains a significant challenge. To address this, we turn to the recently developed framework of combinatorial interpretability. This framework, designed for two-layer neural networks, enables direct inspection of the knowledge encoded in the model weights. We reproduce baseline unlearning methods within the combinatorial interpretability setting and examine their behavior along two dimensions: (i) whether they truly remove knowledge of a target concept (the concept we wish to remove) or merely inhibit its expression while retaining the underlying information, and (ii) how easily the supposedly erased knowledge can be recovered through various fine-tuning operations. Our results shed light within a fully interpretable setting on how knowledge can persist despite unlearning and when it might resurface.

Related papers

Understanding the Dilemma of Unlearning for Large Language Models [50.54260066313032]
Unlearning seeks to remove specific knowledge from large language models (LLMs)<n>We propose unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking.
arXiv Detail & Related papers (2025-09-29T12:15:19Z)
Language Guided Concept Bottleneck Models for Interpretable Continual Learning [62.09201360376577]
Continual learning aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information.<n>Most existing CL methods focus primarily on preserving learned knowledge to improve model performance.<n>We introduce a novel framework that integrates language-guided Concept Bottleneck Models to address both challenges.
arXiv Detail & Related papers (2025-03-30T02:41:55Z)
Unlearning through Knowledge Overwriting: Reversible Federated Unlearning via Selective Sparse Adapter [35.65566527544619]
Federated learning is a promising paradigm for privacy-preserving collaborative model training.<n>We propose FUSED, which first identifies critical layers by analyzing each layer's sensitivity to knowledge.<n> adapters are trained without altering the original parameters, overwriting the unlearning knowledge with the remaining knowledge.
arXiv Detail & Related papers (2025-02-28T04:35:26Z)
FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge [27.571021368687372]
We define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method fails to erase interconnected knowledge.<n>Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings.<n>We propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning.
arXiv Detail & Related papers (2025-02-26T15:11:03Z)
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training [92.88889953768455]
Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge.<n>We identify computational subgraphs that facilitate knowledge storage and processing.
arXiv Detail & Related papers (2025-02-16T16:55:43Z)
Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models [51.20499954955646]
Large language models (LLMs) acquire vast amounts of knowledge from extensive text corpora during the pretraining phase. In later stages such as fine-tuning and inference, the model may encounter knowledge not covered in the initial training. We propose a two-stage fine-tuning strategy to improve the model's overall test accuracy and knowledge retention.
arXiv Detail & Related papers (2024-10-08T08:35:16Z)
Anti-Retroactive Interference for Lifelong Learning [65.50683752919089]
We design a paradigm for lifelong learning based on meta-learning and associative mechanism of the brain. It tackles the problem from two aspects: extracting knowledge and memorizing knowledge. It is theoretically analyzed that the proposed learning paradigm can make the models of different tasks converge to the same optimum.
arXiv Detail & Related papers (2022-08-27T09:27:36Z)
Learning with Recoverable Forgetting [77.56338597012927]
Learning wIth Recoverable Forgetting explicitly handles the task- or sample-specific knowledge removal and recovery. Specifically, LIRF brings in two innovative schemes, namely knowledge deposit and withdrawal. We conduct experiments on several datasets, and demonstrate that the proposed LIRF strategy yields encouraging results with gratifying generalization capability.
arXiv Detail & Related papers (2022-07-17T16:42:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.