Related papers: Defending against Insertion-based Textual Backdoor Attacks via Attribution

Defending against Insertion-based Textual Backdoor Attacks via Attribution

URL: http://arxiv.org/abs/2305.02394v2
Date: Mon, 7 Aug 2023 03:07:59 GMT
Title: Defending against Insertion-based Textual Backdoor Attacks via Attribution
Authors: Jiazhao Li, Zhuofeng Wu, Wei Ping, Chaowei Xiao, V.G. Vinod Vydiswaran
Abstract summary: We propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks. Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results. We show that our proposed method can generalize sufficiently well in two common attack scenarios.
Score: 18.935041122443675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Textual backdoor attack, as a novel attack model, has been shown to be effective in adding a backdoor to the model during training. Defending against such backdoor attacks has become urgent and important. In this paper, we propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks, BadNL and InSent. Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results and therefore are more likely to be poison triggers. Additionally, we further utilize an external pre-trained language model to distinguish whether input is poisoned or not. We show that our proposed method can generalize sufficiently well in two common attack scenarios (poisoning training data and testing data), which consistently improves previous methods. For instance, AttDef can successfully mitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34% (3.99% up) under pre-training and post-training attack defense respectively, achieving the new state-of-the-art performance on prediction recovery over four benchmark datasets.

Related papers

Data Free Backdoor Attacks [83.10379074100453]
DFBA is a retraining-free and data-free backdoor attack without changing the model architecture. We verify that our injected backdoor is provably undetectable and unchosen by various state-of-the-art defenses. Our evaluation on multiple datasets demonstrates that our injected backdoor: 1) incurs negligible classification loss, 2) achieves 100% attack success rates, and 3) bypasses six existing state-of-the-art defenses.
arXiv Detail & Related papers (2024-12-09T05:30:25Z)
Protecting against simultaneous data poisoning attacks [14.893813906644153]
Current backdoor defense methods are evaluated against a single attack at a time. We show that simultaneously executed data poisoning attacks can effectively install multiple backdoors in a single model. We develop a new defense, BaDLoss, that is effective in the multi-attack setting.
arXiv Detail & Related papers (2024-08-23T16:57:27Z)
SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources. Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker. Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z)
Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation [120.42853706967188]
We explore the potential backdoor attacks on model adaptation launched by well-designed poisoning target data. We propose a plug-and-play method named MixAdapt, combining it with existing adaptation algorithms.
arXiv Detail & Related papers (2024-01-11T16:42:10Z)
Beating Backdoor Attack at Its Own Game [10.131734154410763]
Deep neural networks (DNNs) are vulnerable to backdoor attack. Existing defense methods have greatly reduced attack success rate. We propose a highly effective framework which injects non-adversarial backdoors targeting poisoned samples.
arXiv Detail & Related papers (2023-07-28T13:07:42Z)
Rethinking Backdoor Attacks [122.1008188058615]
In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks typically involves viewing these inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them. We show that without structural information about the training data distribution, backdoor attacks are indistinguishable from naturally-occurring features in the data.
arXiv Detail & Related papers (2023-07-19T17:44:54Z)
IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks [45.81957796169348]
Backdoor attacks are an insidious security threat against machine learning models. We introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks. Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers.
arXiv Detail & Related papers (2023-05-25T22:08:57Z)
On the Effectiveness of Adversarial Training against Backdoor Attacks [111.8963365326168]
A backdoored model always predicts a target class in the presence of a predefined trigger pattern. In general, adversarial training is believed to defend against backdoor attacks. We propose a hybrid strategy which provides satisfactory robustness across different backdoor attacks.
arXiv Detail & Related papers (2022-02-22T02:24:46Z)
Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks [58.0225587881455]
In this paper, we find two simple tricks that can make existing textual backdoor attacks much more harmful. The first trick is to add an extra training task to distinguish poisoned and clean data during the training of the victim model. The second one is to use all the clean training data rather than remove the original clean data corresponding to the poisoned data.
arXiv Detail & Related papers (2021-10-15T17:58:46Z)
What Doesn't Kill You Makes You Robust(er): Adversarial Training against Poisons and Backdoors [57.040948169155925]
We extend the adversarial training framework to defend against (training-time) poisoning and backdoor attacks. Our method desensitizes networks to the effects of poisoning by creating poisons during training and injecting them into training batches. We show that this defense withstands adaptive attacks, generalizes to diverse threat models, and incurs a better performance trade-off than previous defenses.
arXiv Detail & Related papers (2021-02-26T17:54:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.