Related papers: Can Adversarial Weight Perturbations Inject Neural Backdoors?

Can Adversarial Weight Perturbations Inject Neural Backdoors?

URL: http://arxiv.org/abs/2008.01761v2
Date: Mon, 21 Sep 2020 04:58:59 GMT
Title: Can Adversarial Weight Perturbations Inject Neural Backdoors?
Authors: Siddhant Garg, Adarsh Kumar, Vibhor Goel, Yingyu Liang
Abstract summary: Adversarial machine learning has exposed several security hazards of neural models. We introduce adversarial perturbations in the model weights using a composite loss on the predictions of the original model. Our results show that backdoors can be successfully injected with a very small average relative change in model weight values.
Score: 22.83199547214051
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adversarial machine learning has exposed several security hazards of neural models and has become an important research topic in recent times. Thus far, the concept of an "adversarial perturbation" has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of "adversarial perturbations" to the space of model weights, specifically to inject backdoors in trained DNNs, which exposes a security risk of using publicly available trained models. Here, injecting a backdoor refers to obtaining a desired outcome from the model when a trigger pattern is added to the input, while retaining the original model predictions on a non-triggered input. From the perspective of an adversary, we characterize these adversarial perturbations to be constrained within an $\ell_{\infty}$ norm around the original model weights. We introduce adversarial perturbations in the model weights using a composite loss on the predictions of the original model and the desired trigger through projected gradient descent. We empirically show that these adversarial weight perturbations exist universally across several computer vision and natural language processing tasks. Our results show that backdoors can be successfully injected with a very small average relative change in model weight values for several applications.

Related papers

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs [14.779177849006963]
We introduce a new method for understanding, monitoring and controlling fine-tuned large language models (LLMs)<n>We demonstrate that the top singular of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors.<n>For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%.
arXiv Detail & Related papers (2025-07-31T21:04:12Z)
MixBridge: Heterogeneous Image-to-Image Backdoor Attack through Mixture of Schrödinger Bridges [90.49625209112223]
MixBridge is a novel diffusion Schr"odinger bridge (DSB) framework to cater to arbitrary input distributions.<n>We show that backdoor triggers can be injected into MixBridge by directly training with poisoned image pairs.<n>We propose a Divide-and-Merge strategy to mix different bridges, where models are independently pre-trained for each specific objective.
arXiv Detail & Related papers (2025-05-12T06:40:23Z)
Evolutionary Trigger Detection and Lightweight Model Repair Based Backdoor Defense [10.310546695762467]
Deep Neural Networks (DNNs) have been widely used in many areas such as autonomous driving and face recognition. A backdoor in the DNN model can be activated by a poisoned input with trigger and leads to wrong prediction. We propose an efficient backdoor defense based on evolutionary trigger detection and lightweight model repair.
arXiv Detail & Related papers (2024-07-07T14:50:59Z)
Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks [63.269788236474234]
We propose to use model pairs on open-set classification tasks for detecting backdoors. We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures. This technique allows for the detection of backdoors on models designed for open-set classification tasks, which is little studied in the literature.
arXiv Detail & Related papers (2024-02-28T21:29:16Z)
Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared Adversarial Examples [67.66153875643964]
Backdoor attacks are serious security threats to machine learning models. In this paper, we explore the task of purifying a backdoored model using a small clean dataset. By establishing the connection between backdoor risk and adversarial risk, we derive a novel upper bound for backdoor risk.
arXiv Detail & Related papers (2023-07-20T03:56:04Z)
Backdoor Defense via Deconfounded Representation Learning [17.28760299048368]
We propose a Causality-inspired Backdoor Defense (CBD) to learn deconfounded representations for reliable classification. CBD is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples.
arXiv Detail & Related papers (2023-03-13T02:25:59Z)
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks. CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z)
Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics. We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z)
Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure. We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)
DeepSight: Mitigating Backdoor Attacks in Federated Learning Through Deep Model Inspection [26.593268413299228]
Federated Learning (FL) allows multiple clients to collaboratively train a Neural Network (NN) model on their private data without revealing the data. DeepSight is a novel model filtering approach for mitigating backdoor attacks. We show that it can mitigate state-of-the-art backdoor attacks with a negligible impact on the model's performance on benign data.
arXiv Detail & Related papers (2022-01-03T17:10:07Z)
Black-box Adversarial Attacks on Network-wide Multi-step Traffic State Prediction Models [4.353029347463806]
We propose an adversarial attack framework by treating the prediction model as a black-box. The adversary can oracle the prediction model with any input and obtain corresponding output. To test the attack effectiveness, two state of the art, graph neural network-based models (GCGRNN and DCRNN) are examined.
arXiv Detail & Related papers (2021-10-17T03:45:35Z)
TOP: Backdoor Detection in Neural Networks via Transferability of Perturbation [1.52292571922932]
Detection of backdoors in trained models without access to the training data or example triggers is an important open problem. In this paper, we identify an interesting property of these models: adversarial perturbations transfer from image to image more readily in poisoned models than in clean models. We use this feature to detect poisoned models in the TrojAI benchmark, as well as additional models.
arXiv Detail & Related papers (2021-03-18T14:13:30Z)
Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch. We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types. In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.