Related papers: Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

URL: http://arxiv.org/abs/2505.02884v1
Date: Mon, 05 May 2025 14:21:08 GMT
Title: Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?
Authors: Guangzhi Sun, Potsawee Manakul, Xiao Zhan, Mark Gales,
Abstract summary: We formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework.<n>We propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions.<n> Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty.
Score: 15.964825460186393
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge. Such methods effectively constitute knowledge addition rather than true removal, often leaving models vulnerable to probing. In this paper, we formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework to assess whether existing approaches genuinely remove targeted information. Moreover, we propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions using KL-divergence, effectively removing knowledge about target individuals and triggering appropriate refusal behaviour. Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty that is much higher than obfuscation on probing questions.

Related papers

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods [0.9999629695552196]
We show that some machine unlearning methods may fail when subjected to straightforward prompt attacks.<n>We employ output-based, logit-based, and probe analysis to determine to what extent supposedly unlearned knowledge can be retrieved.
arXiv Detail & Related papers (2025-06-11T23:36:30Z)
LLM Unlearning Should Be Form-Independent [14.222205207889543]
Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model.<n>We identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples.<n>We introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path.
arXiv Detail & Related papers (2025-06-09T14:21:25Z)
Extracting Unlearned Information from LLMs with Activation Steering [46.16882599881247]
Unlearning has emerged as a solution to remove sensitive knowledge from models after training. We propose activation steering as a method for exact information retrieval from unlearned models. Our results demonstrate that exact information retrieval from unlearned models is possible, highlighting a severe vulnerability of current unlearning techniques.
arXiv Detail & Related papers (2024-11-04T21:42:56Z)
Breaking Chains: Unraveling the Links in Multi-Hop Knowledge Unlearning [38.03304773600225]
Large language models (LLMs) serve as giant information stores, often including personal or copyrighted data, and retraining them from scratch is not a viable option. We propose MUNCH, a simple uncertainty-based approach that breaks down multi-hop queries into subquestions and leverages the uncertainty of the unlearned model in final decision-making.
arXiv Detail & Related papers (2024-10-17T07:00:15Z)
A Closer Look at Machine Unlearning for Large Language Models [46.245404272612795]
Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns.<n>We discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches.
arXiv Detail & Related papers (2024-10-10T16:56:05Z)
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks [85.84979847888157]
Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks.<n>LLMs can implicitly unlearn harmful knowledge that was not explicitly introduced during the unlearning phase.<n>We empirically validate this phenomenon, which makes unlearning-based methods able to decrease the Attack Success Rate.
arXiv Detail & Related papers (2024-07-03T07:14:05Z)
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI [50.61495097098296]
We revisit the paradigm in which unlearning is used for Large Language Models (LLMs) We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context. We argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation.
arXiv Detail & Related papers (2024-06-27T10:24:35Z)
R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning) Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z)
XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.<n>XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.<n>Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination" We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z)
Evaluating Machine Unlearning via Epistemic Uncertainty [78.27542864367821]
This work presents an evaluation of Machine Unlearning algorithms based on uncertainty. This is the first definition of a general evaluation of our best knowledge.
arXiv Detail & Related papers (2022-08-23T09:37:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.