Related papers: Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

URL: http://arxiv.org/abs/2506.10236v2
Date: Thu, 14 Aug 2025 05:03:53 GMT
Title: Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
Authors: Yeonwoo Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz,
Abstract summary: We demonstrate that certain machine unlearning methods may fail under straightforward prompt attacks.<n>We systematically evaluate eight unlearning techniques across three model families using output-based, logit-based, and probe analysis.
Score: 0.9999629695552196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we demonstrate that certain machine unlearning methods may fail under straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families using output-based, logit-based, and probe analysis to assess the extent to which supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR exhibit robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., prepending Hindi filler text to the original prompt recovers 57.3% accuracy). Our logit analysis further indicates that unlearned models are unlikely to hide knowledge through changes in answer formatting, given the strong correlation between output and logit accuracy. These findings challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between genuine knowledge removal and superficial output suppression. To facilitate further research, we publicly release our evaluation framework to easily evaluate prompting techniques to retrieve unlearned knowledge.

Related papers

Auditing Language Model Unlearning via Information Decomposition [68.48660428111593]
We introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID)<n>By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge.<n>Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.
arXiv Detail & Related papers (2026-01-21T15:51:19Z)
Probing Knowledge Holes in Unlearned LLMs [23.377732810945172]
Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training.<n>We find that unlearning may inadvertently create knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture.<n>We propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures.
arXiv Detail & Related papers (2025-10-27T03:11:53Z)
Scalable and Robust LLM Unlearning by Correcting Responses with Retrieved Exclusions [49.55618517046225]
Language models trained on web-scale corpora risk memorizing and exposing sensitive information.<n>We propose Corrective Unlearning with Retrieved Exclusions (CURE), a novel unlearning framework.<n>CURE verifies model outputs for leakage and revises them into safe responses.
arXiv Detail & Related papers (2025-09-30T09:07:45Z)
Understanding the Dilemma of Unlearning for Large Language Models [50.54260066313032]
Unlearning seeks to remove specific knowledge from large language models (LLMs)<n>We propose unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking.
arXiv Detail & Related papers (2025-09-29T12:15:19Z)
Step-by-Step Reasoning Attack: Revealing 'Erased' Knowledge in Large Language Models [9.719371187651591]
Unlearning techniques suppress and leave the knowledge beneath the surface, thus making it retrievable with the right prompts.<n>We introduce a step-by-step reasoning-based black-box attack, Sleek, that systematically exposes unlearning failures.<n>Of the generated adversarial prompts, 62.5% successfully retrieved forgotten Harry Potter facts from WHP-unlearned Llama, while 50% exposed unfair suppression of retained knowledge.
arXiv Detail & Related papers (2025-06-14T04:22:17Z)
Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness [44.37155305736321]
Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs)<n>We propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge.<n>Our framework provides a more realistic and rigorous assessment of unlearning performance.
arXiv Detail & Related papers (2025-06-06T04:35:19Z)
Existing Large Language Model Unlearning Evaluations Are Inconclusive [105.55899615056573]
We show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance.<n>We demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines.<n>We propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness.
arXiv Detail & Related papers (2025-05-31T19:43:00Z)
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge? [15.964825460186393]
We formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework.<n>We propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions.<n> Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty.
arXiv Detail & Related papers (2025-05-05T14:21:08Z)
Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models [10.041289551532804]
We introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery.<n>To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA)<n>UMA actively probes models for forgotten traces using adversarial queries.
arXiv Detail & Related papers (2025-04-21T01:56:15Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning.<n>This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation.<n>Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
Redefining Machine Unlearning: A Conformal Prediction-Motivated Approach [11.609354498110358]
Machine unlearning seeks to remove the influence of specified data from a trained model.<n>In this paper, we find that the data misclassified across UA and MIA still have their ground truth labels included in the prediction set.<n>We propose two novel metrics inspired by conformal prediction that more reliably evaluate forgetting quality.
arXiv Detail & Related papers (2025-01-31T18:58:43Z)
RESTOR: Knowledge Recovery in Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can contain private or sensitive information.<n>Several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints.<n>We propose the RESTOR framework for machine unlearning evaluation.
arXiv Detail & Related papers (2024-10-31T20:54:35Z)
Do Unlearning Methods Remove Information from Language Model Weights? [0.0]
We show that fine-tuning on accessible facts can recover 88% of the pre-unlearning accuracy when applied to current unlearning methods for information learned during pretraining.<n>Our results also suggest that unlearning evaluations that measure unlearning robustness may overestimate robustness compared to evaluations that attempt to unlearn information learned during pretraining.
arXiv Detail & Related papers (2024-10-11T14:06:58Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA [67.75989848202343]
This paper presents a unified end-to-end retriever-reader framework towards knowledge-based VQA. We shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning. Our scheme is able to not only provide guidance for knowledge retrieval, but also drop these instances potentially error-prone towards question answering.
arXiv Detail & Related papers (2022-06-30T02:35:04Z)
Low-Regret Active learning [64.36270166907788]
We develop an online learning algorithm for identifying unlabeled data points that are most informative for training. At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on predictable (easy) instances.
arXiv Detail & Related papers (2021-04-06T22:53:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.