FuguReport

De-attribute to Forget for LLM Unlearning

Authors Xinyang Lu, Jiabao Pan, Rachael Hwee Ling Sim, See-Kiong Ng, Anthony Kum Hoe Tung, Bryan Kian Hsiang Low
Affiliations National University of Singapore
Categories Method / Model Updating / LLM unlearning methods, Task / Data Attribution / Attribution-based data control, Application / LLM / Training data compliance
License CC BY 4.0

Abstract Overview

This paper proposes DareU, a large language model unlearning framework that replaces prediction-loss-based objectives with a data de-attribution objective. Instead of maximizing loss on forget examples, the method aims to reduce the attribution of generated responses to the forget data owners toward zero, motivated by the claim that this target is more precise and less prone to over-forgetting. DareU implements this idea with reinforcement learning, using PPO and attribution scores from a lightweight owner-classification model as reward signals, while adding retain-set distillation regularization to preserve model utility. Experiments on TOFU and ArXiv with Llama2-7B and Qwen3-8B compare DareU against retraining and several unlearning baselines.

Novelty

The paper’s main novelty is to formulate LLM unlearning as data de-attribution rather than prediction-loss manipulation. It also presents, to the authors’ knowledge, the first unlearning framework that uses data attribution scores as reinforcement-learning rewards, operationalized with PPO and an efficient attribution-classifier approximation.

Results

Across TOFU and ArXiv, DareU is reported to achieve the best overall balance between forget quality and retained utility, with the highest Tug-of-War scores across the evaluated models and datasets. On TOFU it reaches forget-set behavior closest to retraining while maintaining relatively strong retain/test performance, and on the more difficult ArXiv setting it remains competitive on forgetting while preserving utility better than strong baselines. Additional analyses show similar behavior with different attribution functions and robustness under several ablation and stress-test settings, although the method is more computationally expensive than simpler loss-based approaches.

Key Points

  1. DareU defines unlearning as minimizing the attribution of model outputs to forget-data owners, aiming for a consistent target value of zero instead of maximizing forget-set loss.
  2. The method uses PPO with attribution-derived rewards and retain-set distillation regularization, with attribution approximated efficiently by a lightweight classifier trained offline.
  3. Empirical comparisons on TOFU and ArXiv indicate a stronger forget-utility trade-off than existing baselines, while incurring higher computational cost than simpler unlearning methods.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.