Mitigating Backdoor Attacks using Activation-Guided Model Editing
- URL: http://arxiv.org/abs/2407.07662v2
- Date: Mon, 30 Sep 2024 12:41:19 GMT
- Title: Mitigating Backdoor Attacks using Activation-Guided Model Editing
- Authors: Felix Hsieh, Huy H. Nguyen, AprilPyone MaungMaung, Dmitrii Usynin, Isao Echizen,
- Abstract summary: Backdoor attacks compromise the integrity and reliability of machine learning models.
We propose a novel backdoor mitigation approach via machine unlearning to counter such backdoor attacks.
- Score: 8.00994004466919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Backdoor attacks compromise the integrity and reliability of machine learning models by embedding a hidden trigger during the training process, which can later be activated to cause unintended misbehavior. We propose a novel backdoor mitigation approach via machine unlearning to counter such backdoor attacks. The proposed method utilizes model activation of domain-equivalent unseen data to guide the editing of the model's weights. Unlike the previous unlearning-based mitigation methods, ours is computationally inexpensive and achieves state-of-the-art performance while only requiring a handful of unseen samples for unlearning. In addition, we also point out that unlearning the backdoor may cause the whole targeted class to be unlearned, thus introducing an additional repair step to preserve the model's utility after editing the model. Experiment results show that the proposed method is effective in unlearning the backdoor on different datasets and trigger patterns.
Related papers
- Unlearn to Relearn Backdoors: Deferred Backdoor Functionality Attacks on Deep Learning Models [6.937795040660591]
We introduce Deferred Activated Backdoor Functionality (DABF) as a new paradigm in backdoor attacks.
Unlike conventional attacks, DABF initially conceals its backdoor, producing benign outputs even when triggered.
DABF attacks exploit the common practice in the life cycle of machine learning models to perform model updates and fine-tuning after initial deployment.
arXiv Detail & Related papers (2024-11-10T07:01:53Z) - Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities.
In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z) - DLP: towards active defense against backdoor attacks with decoupled learning process [2.686336957004475]
We propose a general training pipeline to defend against backdoor attacks.
We show that the model shows different learning behaviors in clean and poisoned subsets during training.
The effectiveness of our approach has been shown in numerous experiments across various backdoor attacks and datasets.
arXiv Detail & Related papers (2024-06-18T23:04:38Z) - Unlearning Backdoor Attacks through Gradient-Based Model Pruning [10.801476967873173]
We propose a novel approach to counter backdoor attacks by treating their mitigation as an unlearning task.
Our approach offers simplicity and effectiveness, rendering it well-suited for scenarios with limited data availability.
arXiv Detail & Related papers (2024-05-07T00:36:56Z) - Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features.
backdoor attacks subtly embed malicious behaviors within the model during training.
We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared
Adversarial Examples [67.66153875643964]
Backdoor attacks are serious security threats to machine learning models.
In this paper, we explore the task of purifying a backdoored model using a small clean dataset.
By establishing the connection between backdoor risk and adversarial risk, we derive a novel upper bound for backdoor risk.
arXiv Detail & Related papers (2023-07-20T03:56:04Z) - Fine-Tuning Is All You Need to Mitigate Backdoor Attacks [10.88508085229675]
We show that fine-tuning can effectively remove backdoors from machine learning models while maintaining high model utility.
We coin a new term, namely backdoor sequela, to measure the changes in model vulnerabilities to other attacks before and after the backdoor has been removed.
arXiv Detail & Related papers (2022-12-18T11:30:59Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z) - Backdoor Defense with Machine Unlearning [32.968653927933296]
We propose BAERASE, a novel method that can erase the backdoor injected into the victim model through machine unlearning.
BAERASE can averagely lower the attack success rates of three kinds of state-of-the-art backdoor attacks by 99% on four benchmark datasets.
arXiv Detail & Related papers (2022-01-24T09:09:12Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.