Related papers: Delta-Influence: Unlearning Poisons via Influence Functions

Delta-Influence: Unlearning Poisons via Influence Functions

URL: http://arxiv.org/abs/2411.13731v1
Date: Wed, 20 Nov 2024 22:15:10 GMT
Title: Delta-Influence: Unlearning Poisons via Influence Functions
Authors: Wenjie Li, Jiawei Li, Christian Schroeder de Witt, Ameya Prabhu, Amartya Sanyal,
Abstract summary: We introduce $Delta$-Influence, a novel approach to trace abnormal model behavior back to poisoned training data. $Delta$-Influence applies data transformations that sever the link between poisoned training data and compromised test points. We show that $Delta$-Influence consistently achieves the best unlearning across all settings.
Score: 18.97730860349776
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Addressing data integrity challenges, such as unlearning the effects of data poisoning after model training, is necessary for the reliable deployment of machine learning models. State-of-the-art influence functions, such as EK-FAC, often fail to accurately attribute abnormal model behavior to the specific poisoned training data responsible for the data poisoning attack. In addition, traditional unlearning algorithms often struggle to effectively remove the influence of poisoned samples, particularly when only a few affected examples can be identified. To address these challenge, we introduce $\Delta$-Influence, a novel approach that leverages influence functions to trace abnormal model behavior back to the responsible poisoned training data using as little as just one poisoned test example. $\Delta$-Influence applies data transformations that sever the link between poisoned training data and compromised test points without significantly affecting clean data. This allows $\Delta$-Influence to detect large negative shifts in influence scores following data transformations, a phenomenon we term as influence collapse, thereby accurately identifying poisoned training data. Unlearning this subset, e.g. through retraining, effectively eliminates the data poisoning. We validate our method across three vision-based poisoning attacks and three datasets, benchmarking against four detection algorithms and five unlearning strategies. We show that $\Delta$-Influence consistently achieves the best unlearning across all settings, showing the promise of influence functions for corrective unlearning. Our code is publicly available at: \url{https://github.com/andyisokay/delta-influence}

Related papers

IF-GUIDE: Influence Function-Guided Detoxification of LLMs [53.051109450536885]
We study how training data contributes to the emergence of toxic behaviors in large-language models.<n>We propose a $proactive approach that leverages influence functions to identify harmful tokens within any training data and suppress their impact during training.<n>We present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective.
arXiv Detail & Related papers (2025-06-02T15:32:36Z)
Detecting Instruction Fine-tuning Attack on Language Models with Influence Function [6.760293300577228]
Instruction fine-tuning attacks undermine model alignment and pose security risks in real-world deployment. We present a simple and effective approach to detect and mitigate such attacks using influence functions. We are the first to apply influence functions for detecting language model instruction fine-tuning attacks on large-scale datasets.
arXiv Detail & Related papers (2025-04-12T00:50:28Z)
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning [32.508939142492004]
We introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models.
arXiv Detail & Related papers (2024-10-11T13:50:50Z)
Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks. It is quite beneficial and challenging to detect poisoned samples from a mixed dataset. We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z)
Corrective Machine Unlearning [22.342035149807923]
We formalize Corrective Machine Unlearning as the problem of mitigating the impact of data affected by unknown manipulations on a trained model. We find most existing unlearning methods, including retraining-from-scratch without the deletion set, require most of the manipulated data to be identified for effective corrective unlearning. One approach, Selective Synaptic Dampening, achieves limited success, unlearning adverse effects with just a small portion of the manipulated samples in our setting.
arXiv Detail & Related papers (2024-02-21T18:54:37Z)
HINT: Healthy Influential-Noise based Training to Defend against Data Poisoning Attacks [12.929357709840975]
We propose an efficient and robust training approach to defend against data poisoning attacks based on influence functions. Using influence functions, we craft healthy noise that helps to harden the classification model against poisoning attacks. Our empirical results show that HINT can efficiently protect deep learning models against the effect of both untargeted and targeted poisoning attacks.
arXiv Detail & Related papers (2023-09-15T17:12:19Z)
On the Exploitability of Instruction Tuning [103.8077787502381]
In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior. We propose textitAutoPoison, an automated data poisoning pipeline. Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
arXiv Detail & Related papers (2023-06-28T17:54:04Z)
Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information. By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples. We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z)
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z)
Accumulative Poisoning Attacks on Real-time Data [56.96241557830253]
We show that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects. Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects.
arXiv Detail & Related papers (2021-06-18T08:29:53Z)
FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging [112.19994766375231]
Influence functions approximate the 'influences' of training data-points for test predictions. We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time. Our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors.
arXiv Detail & Related papers (2020-12-31T18:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.