Related papers: Dissecting Fine-Tuning Unlearning in Large Language Models

Dissecting Fine-Tuning Unlearning in Large Language Models

URL: http://arxiv.org/abs/2410.06606v2
Date: Tue, 15 Oct 2024 09:23:21 GMT
Title: Dissecting Fine-Tuning Unlearning in Large Language Models
Authors: Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, Haiqin Yang,
Abstract summary: Fine-tuning-based unlearning methods prevail for preventing harmful, sensitive, or copyrighted information within large language models. However, the true effectiveness of these methods is unclear. In this work, we delve into the limitations of fine-tuning-based unlearning through activation patching and restoration experiments.
Score: 12.749301272512222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning-based unlearning methods prevail for preventing targeted harmful, sensitive, or copyrighted information within large language models while preserving overall capabilities. However, the true effectiveness of these methods is unclear. In this work, we delve into the limitations of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods alter the model's knowledge retrieval process, providing further evidence that they do not genuinely erase the problematic knowledge embedded in the model parameters. Instead, the coefficients generated by the MLP components in the model's final layer are the primary contributors to these seemingly positive unlearning effects, playing a crucial role in controlling the model's behaviors. Furthermore, behavioral tests demonstrate that this unlearning mechanism inevitably impacts the global behavior of the models, affecting unrelated knowledge or capabilities. The code is released at https://github.com/yihuaihong/Dissecting-FT-Unlearning.

Related papers

Efficient Machine Unlearning via Influence Approximation [75.31015485113993]
Influence-based unlearning has emerged as a prominent approach to estimate the impact of individual training samples on model parameters without retraining.<n>This paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning)<n>We introduce the Influence Approximation Unlearning algorithm for efficient machine unlearning from the incremental perspective.
arXiv Detail & Related papers (2025-07-31T05:34:27Z)
How to Protect Models against Adversarial Unlearning? [0.24578723416255746]
We investigate the problem of adversarial unlearning, where a malicious party intentionally sends unlearn requests to deteriorate the model's performance.<n>We show that this phenomenon and the adversary's capabilities depend on many factors, primarily on the backbone model itself and strategy/limitations in selecting data to be unlearned.<n>The main result of this work is a new method of protecting model performance from these side effects, both in the case of unlearned behavior resulting from spontaneous processes and adversary actions.
arXiv Detail & Related papers (2025-07-15T00:59:42Z)
Graceful Forgetting in Generative Language Models [19.413048064877824]
We propose a novel framework, Learning With Forgetting, to achieve graceful forgetting in generative language models.<n>With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task.<n>Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.
arXiv Detail & Related papers (2025-05-26T09:03:57Z)
UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models [54.75551043657238]
We introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors.<n>UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings.
arXiv Detail & Related papers (2025-05-21T15:53:28Z)
Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models [51.20499954955646]
Large language models (LLMs) acquire vast amounts of knowledge from extensive text corpora during the pretraining phase. In later stages such as fine-tuning and inference, the model may encounter knowledge not covered in the initial training. We propose a two-stage fine-tuning strategy to improve the model's overall test accuracy and knowledge retention.
arXiv Detail & Related papers (2024-10-08T08:35:16Z)
Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration [74.09687562334682]
We introduce a novel training data attribution method called Debias and Denoise Attribution (DDA) Our method significantly outperforms existing approaches, achieving an averaged AUC of 91.64%. DDA exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.
arXiv Detail & Related papers (2024-10-02T07:14:26Z)
Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models [52.03511469562013]
We introduce the Iterative Contrastive Unlearning (ICU) framework, which consists of three core components. A Knowledge Unlearning Induction module targets specific knowledge for removal using an unlearning loss. A Contrastive Learning Enhancement module preserves the model's expressive capabilities against the pure unlearning goal. An Iterative Unlearning Refinement module dynamically adjusts the unlearning process through ongoing evaluation and updates.
arXiv Detail & Related papers (2024-07-25T07:09:35Z)
Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning [97.2995389188179]
Recent research has begun to approach large language models (LLMs) unlearning via gradient ascent (GA) Despite their simplicity and efficiency, we suggest that GA-based methods face the propensity towards excessive unlearning. We propose several controlling methods that can regulate the extent of excessive unlearning.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
Efficient Knowledge Deletion from Trained Models through Layer-wise Partial Machine Unlearning [2.3496568239538083]
This paper introduces a novel class of machine unlearning algorithms. First method is partial amnesiac unlearning, integration of layer-wise pruning with amnesiac unlearning. Second method assimilates layer-wise partial-updates into label-flipping and optimization-based unlearning.
arXiv Detail & Related papers (2024-03-12T12:49:47Z)
STAR: Constraint LoRA with Dynamic Active Learning for Data-Efficient Fine-Tuning of Large Language Models [21.929902181609936]
We propose a novel approach to integrate uncertainty-based active learning and LoRA. For the uncertainty gap, we introduce a dynamic uncertainty measurement that combines the uncertainty of the base model and the uncertainty of the full model. For poor model calibration, we incorporate the regularization method during LoRA training to keep the model from being over-confident.
arXiv Detail & Related papers (2024-03-02T10:38:10Z)
Enhancing Dynamical System Modeling through Interpretable Machine Learning Augmentations: A Case Study in Cathodic Electrophoretic Deposition [0.8796261172196743]
We introduce a comprehensive data-driven framework aimed at enhancing the modeling of physical systems. As a demonstrative application, we pursue the modeling of cathodic electrophoretic deposition (EPD), commonly known as e-coating.
arXiv Detail & Related papers (2024-01-16T14:58:21Z)
Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data. This process might suffer from privacy issues and violations of data protection regulations. We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z)
Multicriteria interpretability driven Deep Learning [0.0]
Deep Learning methods are renowned for their performances, yet their lack of interpretability prevents them from high-stakes contexts. Recent model methods address this problem by providing post-hoc interpretability methods by reverse-engineering the model's inner workings. We propose a Multicriteria agnostic technique that allows to control the feature effects on the model's outcome by injecting knowledge in the objective function.
arXiv Detail & Related papers (2021-11-28T09:41:13Z)
Machine Unlearning of Features and Labels [72.81914952849334]
We propose first scenarios for unlearning and labels in machine learning models. Our approach builds on the concept of influence functions and realizes unlearning through closed-form updates of model parameters.
arXiv Detail & Related papers (2021-08-26T04:42:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.