IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks
- URL: http://arxiv.org/abs/2305.16503v1
- Date: Thu, 25 May 2023 22:08:57 GMT
- Title: IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks
- Authors: Xuanli He, Jun Wang, Benjamin Rubinstein, Trevor Cohn
- Abstract summary: Backdoor attacks are an insidious security threat against machine learning models.
We introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks.
Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers.
- Score: 45.81957796169348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Backdoor attacks are an insidious security threat against machine learning
models. Adversaries can manipulate the predictions of compromised models by
inserting triggers into the training phase. Various backdoor attacks have been
devised which can achieve nearly perfect attack success without affecting model
predictions for clean inputs. Means of mitigating such vulnerabilities are
underdeveloped, especially in natural language processing. To fill this gap, we
introduce IMBERT, which uses either gradients or self-attention scores derived
from victim models to self-defend against backdoor attacks at inference time.
Our empirical studies demonstrate that IMBERT can effectively identify up to
98.5% of inserted triggers. Thus, it significantly reduces the attack success
rate while attaining competitive accuracy on the clean dataset across
widespread insertion-based attacks compared to two baselines. Finally, we show
that our approach is model-agnostic, and can be easily ported to several
pre-trained transformer models.
Related papers
- DMGNN: Detecting and Mitigating Backdoor Attacks in Graph Neural Networks [30.766013737094532]
We propose DMGNN against out-of-distribution (OOD) and in-distribution (ID) graph backdoor attacks.
DMGNN can easily identify the hidden ID and OOD triggers via predicting label transitions based on counterfactual explanation.
DMGNN far outperforms the state-of-the-art (SOTA) defense methods, reducing the attack success rate to 5% with almost negligible degradation in model performance.
arXiv Detail & Related papers (2024-10-18T01:08:03Z) - Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities.
In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z) - Adversarial Attacks and Defenses in Multivariate Time-Series Forecasting for Smart and Connected Infrastructures [0.9217021281095907]
We investigate the impact of adversarial attacks on time-series forecasting.
We employ untargeted white-box attacks to poison the inputs to the training process, effectively misleading the model.
Having demonstrated the feasibility of these attacks, we develop robust models through adversarial training and model hardening.
arXiv Detail & Related papers (2024-08-27T08:44:31Z) - Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features.
backdoor attacks subtly embed malicious behaviors within the model during training.
We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z) - Hijacking Attacks against Neural Networks by Analyzing Training Data [21.277867143827812]
CleanSheet is a new model hijacking attack that obtains the high performance of backdoor attacks without requiring the adversary to train the model.
CleanSheet exploits vulnerabilities in tampers stemming from the training data.
Results show that CleanSheet exhibits comparable to state-of-the-art backdoor attacks, achieving an average attack success rate (ASR) of 97.5% on CIFAR-100 and 92.4% on GTSRB.
arXiv Detail & Related papers (2024-01-18T05:48:56Z) - Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation [120.42853706967188]
We explore the potential backdoor attacks on model adaptation launched by well-designed poisoning target data.
We propose a plug-and-play method named MixAdapt, combining it with existing adaptation algorithms.
arXiv Detail & Related papers (2024-01-11T16:42:10Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models.
In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.