Related papers: Probing Knowledge Holes in Unlearned LLMs

Probing Knowledge Holes in Unlearned LLMs

URL: http://arxiv.org/abs/2511.00030v1
Date: Mon, 27 Oct 2025 03:11:53 GMT
Title: Probing Knowledge Holes in Unlearned LLMs
Authors: Myeongseob Ko, Hoang Anh Just, Charles Fleming, Ming Jin, Ruoxi Jia,
Abstract summary: Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training.<n>We find that unlearning may inadvertently create knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture.<n>We propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures.
Score: 23.377732810945172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training, without requiring full retraining. While recent unlearning techniques can effectively remove undesirable content without severely compromising performance on standard benchmarks, we find that they may inadvertently create ``knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture. To probe where unlearned models reveal knowledge holes, we propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures. Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7\% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks.

Related papers

The Unseen Threat: Residual Knowledge in Machine Unlearning under Perturbed Samples [16.030881842099998]
We show that slight perturbeds of forget samples may still be correctly recognized by the unlearned model.<n>We propose a fine-tuning strategy, named RURK, that penalizes the model's ability to re-recognize forget samples.
arXiv Detail & Related papers (2026-01-29T22:10:13Z)
Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion [27.526437626781597]
We propose Knowledge Density-Guided Unlearning via Blocks Reinsertion (KUnBR) for large language models.<n>KUnBR identifies layers with rich harmful knowledge and then thoroughly eliminates the harmful knowledge via re-insertion strategy.<n>Experiments conducted on several unlearning and general capability benchmarks demonstrate that KUnBR achieves state-of-the-art forgetting performance.
arXiv Detail & Related papers (2025-11-11T14:12:43Z)
Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding [18.830386174815583]
We show that textitalmost all existing unlearning methods fail to achieve true forgetting in practice.<n>We introduce textttleak@$k$, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing.
arXiv Detail & Related papers (2025-11-07T02:30:05Z)
Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check [60.77691669644931]
We propose Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models.<n>We show that FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning.<n>These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
arXiv Detail & Related papers (2025-10-14T20:50:30Z)
Scalable and Robust LLM Unlearning by Correcting Responses with Retrieved Exclusions [49.55618517046225]
Language models trained on web-scale corpora risk memorizing and exposing sensitive information.<n>We propose Corrective Unlearning with Retrieved Exclusions (CURE), a novel unlearning framework.<n>CURE verifies model outputs for leakage and revises them into safe responses.
arXiv Detail & Related papers (2025-09-30T09:07:45Z)
Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs [31.768387661474904]
Unlearning in large language models (LLMs) involves precisely removing specific information from a pre-trained model.<n>This is crucial to ensure safety of LLMs by deleting private data or harmful knowledge acquired during pre-training.<n>We introduce JensUn, where we leverage the Jensen-Shannon Divergence as the training objective for both forget and retain sets.<n>In extensive experiments, JensUn achieves better forget-utility trade-off than competing methods, and even demonstrates strong resilience to benign relearning.
arXiv Detail & Related papers (2025-09-02T20:38:53Z)
Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods [0.9999629695552196]
We demonstrate that certain machine unlearning methods may fail under straightforward prompt attacks.<n>We systematically evaluate eight unlearning techniques across three model families using output-based, logit-based, and probe analysis.
arXiv Detail & Related papers (2025-06-11T23:36:30Z)
Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models [10.041289551532804]
We introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery.<n>To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA)<n>UMA actively probes models for forgotten traces using adversarial queries.
arXiv Detail & Related papers (2025-04-21T01:56:15Z)
Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models [81.62767292169225]
We investigate knowledge forgetting in large language models with a focus on its generalisation.<n>We propose PerMU, a novel probability perturbation-based unlearning paradigm.<n>Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE.
arXiv Detail & Related papers (2025-02-27T11:03:33Z)
RESTOR: Knowledge Recovery in Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can contain private or sensitive information.<n>Several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints.<n>We propose the RESTOR framework for machine unlearning evaluation.
arXiv Detail & Related papers (2024-10-31T20:54:35Z)
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI [50.61495097098296]
We revisit the paradigm in which unlearning is used for Large Language Models (LLMs) We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context. We argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation.
arXiv Detail & Related papers (2024-06-27T10:24:35Z)
R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning) Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z)
A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA [67.75989848202343]
This paper presents a unified end-to-end retriever-reader framework towards knowledge-based VQA. We shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning. Our scheme is able to not only provide guidance for knowledge retrieval, but also drop these instances potentially error-prone towards question answering.
arXiv Detail & Related papers (2022-06-30T02:35:04Z)
Does Pre-training Induce Systematic Inference? How Masked Language Models Acquire Commonsense Knowledge [91.15301779076187]
We introduce verbalized knowledge into the minibatches of a BERT model during pre-training and evaluate how well the model generalizes to supported inferences. We find generalization does not improve over the course of pre-training, suggesting that commonsense knowledge is acquired from surface-level, co-occurrence patterns rather than induced, systematic reasoning.
arXiv Detail & Related papers (2021-12-16T03:13:04Z)
Do Not Forget to Attend to Uncertainty while Mitigating Catastrophic Forgetting [29.196246255389664]
One of the major limitations of deep learning models is that they face catastrophic forgetting in an incremental learning scenario. We consider a Bayesian formulation to obtain the data and model uncertainties. We also incorporate self-attention framework to address the incremental learning problem.
arXiv Detail & Related papers (2021-02-03T06:54:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.