Can Adversarial Weight Perturbations Inject Neural Backdoors?
- URL: http://arxiv.org/abs/2008.01761v2
- Date: Mon, 21 Sep 2020 04:58:59 GMT
- Title: Can Adversarial Weight Perturbations Inject Neural Backdoors?
- Authors: Siddhant Garg, Adarsh Kumar, Vibhor Goel, Yingyu Liang
- Abstract summary: Adversarial machine learning has exposed several security hazards of neural models.
We introduce adversarial perturbations in the model weights using a composite loss on the predictions of the original model.
Our results show that backdoors can be successfully injected with a very small average relative change in model weight values.
- Score: 22.83199547214051
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adversarial machine learning has exposed several security hazards of neural
models and has become an important research topic in recent times. Thus far,
the concept of an "adversarial perturbation" has exclusively been used with
reference to the input space referring to a small, imperceptible change which
can cause a ML model to err. In this work we extend the idea of "adversarial
perturbations" to the space of model weights, specifically to inject backdoors
in trained DNNs, which exposes a security risk of using publicly available
trained models. Here, injecting a backdoor refers to obtaining a desired
outcome from the model when a trigger pattern is added to the input, while
retaining the original model predictions on a non-triggered input. From the
perspective of an adversary, we characterize these adversarial perturbations to
be constrained within an $\ell_{\infty}$ norm around the original model
weights. We introduce adversarial perturbations in the model weights using a
composite loss on the predictions of the original model and the desired trigger
through projected gradient descent. We empirically show that these adversarial
weight perturbations exist universally across several computer vision and
natural language processing tasks. Our results show that backdoors can be
successfully injected with a very small average relative change in model weight
values for several applications.
Related papers
- Evolutionary Trigger Detection and Lightweight Model Repair Based Backdoor Defense [10.310546695762467]
Deep Neural Networks (DNNs) have been widely used in many areas such as autonomous driving and face recognition.
A backdoor in the DNN model can be activated by a poisoned input with trigger and leads to wrong prediction.
We propose an efficient backdoor defense based on evolutionary trigger detection and lightweight model repair.
arXiv Detail & Related papers (2024-07-07T14:50:59Z) - Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks [63.269788236474234]
We propose to use model pairs on open-set classification tasks for detecting backdoors.
We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures.
This technique allows for the detection of backdoors on models designed for open-set classification tasks, which is little studied in the literature.
arXiv Detail & Related papers (2024-02-28T21:29:16Z) - Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared
Adversarial Examples [67.66153875643964]
Backdoor attacks are serious security threats to machine learning models.
In this paper, we explore the task of purifying a backdoored model using a small clean dataset.
By establishing the connection between backdoor risk and adversarial risk, we derive a novel upper bound for backdoor risk.
arXiv Detail & Related papers (2023-07-20T03:56:04Z) - Backdoor Defense via Deconfounded Representation Learning [17.28760299048368]
We propose a Causality-inspired Backdoor Defense (CBD) to learn deconfounded representations for reliable classification.
CBD is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples.
arXiv Detail & Related papers (2023-03-13T02:25:59Z) - CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive
Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks.
CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z) - Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics.
We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z) - DeepSight: Mitigating Backdoor Attacks in Federated Learning Through
Deep Model Inspection [26.593268413299228]
Federated Learning (FL) allows multiple clients to collaboratively train a Neural Network (NN) model on their private data without revealing the data.
DeepSight is a novel model filtering approach for mitigating backdoor attacks.
We show that it can mitigate state-of-the-art backdoor attacks with a negligible impact on the model's performance on benign data.
arXiv Detail & Related papers (2022-01-03T17:10:07Z) - Black-box Adversarial Attacks on Network-wide Multi-step Traffic State
Prediction Models [4.353029347463806]
We propose an adversarial attack framework by treating the prediction model as a black-box.
The adversary can oracle the prediction model with any input and obtain corresponding output.
To test the attack effectiveness, two state of the art, graph neural network-based models (GCGRNN and DCRNN) are examined.
arXiv Detail & Related papers (2021-10-17T03:45:35Z) - TOP: Backdoor Detection in Neural Networks via Transferability of
Perturbation [1.52292571922932]
Detection of backdoors in trained models without access to the training data or example triggers is an important open problem.
In this paper, we identify an interesting property of these models: adversarial perturbations transfer from image to image more readily in poisoned models than in clean models.
We use this feature to detect poisoned models in the TrojAI benchmark, as well as additional models.
arXiv Detail & Related papers (2021-03-18T14:13:30Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.