Related papers: WaterDrum: Watermarking for Data-centric Unlearning Metric

WaterDrum: Watermarking for Data-centric Unlearning Metric

URL: http://arxiv.org/abs/2505.05064v1
Date: Thu, 08 May 2025 08:56:46 GMT
Title: WaterDrum: Watermarking for Data-centric Unlearning Metric
Authors: Xinyang Lu, Xinyuan Niu, Gregory Kang Ruey Lau, Bui Thi Cam Nhung, Rachael Hwee Ling Sim, Fanyu Wen, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low,
Abstract summary: Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users.<n>This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking for overcoming limitations.<n>We also introduce new benchmark datasets for LLM unlearning that contain varying levels of similar data points and can be used to rigorously evaluate unlearning algorithms using WaterDrum.
Score: 47.36231091296615
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have semantically similar content, (b) retraining the model from scratch on the retain set is impractical, and/or (c) the model owner can improve the unlearning metric without directly performing unlearning on the LLM. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking for overcoming these limitations. We also introduce new benchmark datasets for LLM unlearning that contain varying levels of similar data points and can be used to rigorously evaluate unlearning algorithms using WaterDrum. Our code is available at https://github.com/lululu008/WaterDrum and our new benchmark datasets are released at https://huggingface.co/datasets/Glow-AI/WaterDrum-Ax.

Related papers

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs [19.525112900768534]
We show that models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning.<n>We introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information.<n>Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting.
arXiv Detail & Related papers (2025-05-22T16:02:10Z)
LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks [23.5632914682956]
Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior.<n>We show that LLM unlearning can be effectively maintained using a significantly smaller subset (functioning as a "coreset")<n>This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime.
arXiv Detail & Related papers (2025-04-14T12:38:37Z)
Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods [1.9799527196428242]
Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes.<n>We show that unlearning has a notable impact on general model capabilities.<n>We show that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks.
arXiv Detail & Related papers (2024-11-18T22:31:17Z)
LLM Unlearning via Loss Adjustment with Only Forget Data [20.310423152885217]
We introduce Forget data only Loss AjustmenT (FLAT), a "flat" loss adjustment approach which addresses these issues. Empirical results demonstrate that our approach achieves superior unlearning performance compared to existing methods.
arXiv Detail & Related papers (2024-10-14T23:43:33Z)
MUSE: Machine Unlearning Six-Way Evaluation for Language Models [109.76505405962783]
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. We propose MUSE, a comprehensive machine unlearning evaluation benchmark. We benchmark how effectively eight popular unlearning algorithms can unlearn Harry Potter books and news articles.
arXiv Detail & Related papers (2024-07-08T23:47:29Z)
Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs [18.629717934007513]
"SPlit, UNlearn, MerGE" (SPUNGE) is a framework that can be used with any unlearning method to amplify its effectiveness. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs.
arXiv Detail & Related papers (2024-06-17T17:35:52Z)
Offset Unlearning for Large Language Models [49.851093293780615]
delta-Unlearning is an offset unlearning framework for black-box LLMs.<n>We show that delta-Unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks.
arXiv Detail & Related papers (2024-04-17T03:39:51Z)
Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning. Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset. We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU) We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
Rapid Adaptation in Online Continual Learning: Are We Evaluating It Right? [135.71855998537347]
We revisit the common practice of evaluating adaptation of Online Continual Learning (OCL) algorithms through the metric of online accuracy. We show that this metric is unreliable, as even vacuous blind classifiers can achieve unrealistically high online accuracy. Existing OCL algorithms can also achieve high online accuracy, but perform poorly in retaining useful information.
arXiv Detail & Related papers (2023-05-16T08:29:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.