Related papers: Existing Large Language Model Unlearning Evaluations Are Inconclusive

Existing Large Language Model Unlearning Evaluations Are Inconclusive

URL: http://arxiv.org/abs/2506.00688v1
Date: Sat, 31 May 2025 19:43:00 GMT
Title: Existing Large Language Model Unlearning Evaluations Are Inconclusive
Authors: Zhili Feng, Yixuan Even Xu, Alexander Robey, Robert Kirk, Xander Davies, Yarin Gal, Avi Schwarzschild, J. Zico Kolter,
Abstract summary: We show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance.<n>We demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines.<n>We propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness.
Score: 105.55899615056573
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine unlearning aims to remove sensitive or undesired data from large language models. However, recent studies suggest that unlearning is often shallow, claiming that removed knowledge can easily be recovered. In this work, we critically examine standard unlearning evaluation practices and uncover key limitations that shake our trust in those findings. First, we show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance by re-teaching the model during testing. Second, we demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines. Finally, we find that many evaluations rely on spurious correlations, making their results difficult to trust and interpret. Taken together, these issues suggest that current evaluation protocols may both overstate and understate unlearning success. To address this, we propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness. We validate these principles through a series of targeted experiments, showing how violations of each can lead to misleading conclusions.

Related papers

SoK: Machine Unlearning for Large Language Models [14.88062383081161]
Large language model (LLM) unlearning has become a critical topic in machine learning.<n>We propose a new taxonomy based on the intention of unlearning.
arXiv Detail & Related papers (2025-06-10T20:30:39Z)
Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols [14.961054239793356]
We introduce a rigorous unlearning evaluation setup, in which forgetting classes exhibit semantic similarity to downstream task classes.<n>We hope our benchmark serves as a standardized protocol for evaluating unlearning algorithms under realistic conditions.
arXiv Detail & Related papers (2025-03-10T07:11:34Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning.<n>This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation.<n>Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning [8.831339626121848]
Concept unlearning is a promising solution to unethical or harmful use of text-to-image diffusion models.<n>Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW.<n>Our investigation reveals that no single method excels across all evaluation criteria.
arXiv Detail & Related papers (2024-10-08T03:30:39Z)
Training on the Test Task Confounds Evaluation and Emergence [16.32378359459614]
We show that training on the test task confounds both relative model evaluations and claims about emergent capabilities.<n>We propose an effective method to adjust for the effect of training on the test task on benchmark evaluations.
arXiv Detail & Related papers (2024-07-10T17:57:58Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification [0.491574468325115]
We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
arXiv Detail & Related papers (2022-11-28T12:25:27Z)
Evaluating Machine Unlearning via Epistemic Uncertainty [78.27542864367821]
This work presents an evaluation of Machine Unlearning algorithms based on uncertainty. This is the first definition of a general evaluation of our best knowledge.
arXiv Detail & Related papers (2022-08-23T09:37:31Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Latent Opinions Transfer Network for Target-Oriented Opinion Words Extraction [63.70885228396077]
We propose a novel model to transfer opinions knowledge from resource-rich review sentiment classification datasets to low-resource task TOWE. Our model achieves better performance compared to other state-of-the-art methods and significantly outperforms the base model without transferring opinions knowledge.
arXiv Detail & Related papers (2020-01-07T11:50:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.