Related papers: IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks

IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks

URL: http://arxiv.org/abs/2305.16503v1
Date: Thu, 25 May 2023 22:08:57 GMT
Title: IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks
Authors: Xuanli He, Jun Wang, Benjamin Rubinstein, Trevor Cohn
Abstract summary: Backdoor attacks are an insidious security threat against machine learning models. We introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks. Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers.
Score: 45.81957796169348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Backdoor attacks are an insidious security threat against machine learning models. Adversaries can manipulate the predictions of compromised models by inserting triggers into the training phase. Various backdoor attacks have been devised which can achieve nearly perfect attack success without affecting model predictions for clean inputs. Means of mitigating such vulnerabilities are underdeveloped, especially in natural language processing. To fill this gap, we introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks at inference time. Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers. Thus, it significantly reduces the attack success rate while attaining competitive accuracy on the clean dataset across widespread insertion-based attacks compared to two baselines. Finally, we show that our approach is model-agnostic, and can be easily ported to several pre-trained transformer models.

Related papers

Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models [42.81731204702258]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective method that operates on the text prompts to indirectly purify poisoned Vision-Language Models (VLMs) CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
Injecting Bias into Text Classification Models using Backdoor Attacks [0.0]
We propose to utilize backdoor attacks for a new purpose: bias injection. We develop a backdoor attack in which a subset of the training dataset is poisoned to associate strong male actors with negative sentiment. Our results show that the reduction in backdoored models' benign classification accuracy is limited.
arXiv Detail & Related papers (2024-12-25T19:32:02Z)
DeTrigger: A Gradient-Centric Approach to Backdoor Attack Mitigation in Federated Learning [4.932796168357307]
Federated Learning (FL) enables collaborative model training across distributed devices while preserving local data privacy, making it ideal for mobile and embedded systems. However, the decentralized nature of FL also opens vulnerabilities to model poisoning attacks, particularly backdoor attacks. We propose DeTrigger, a scalable and efficient backdoor-robust federated learning framework.
arXiv Detail & Related papers (2024-11-19T04:12:14Z)
DMGNN: Detecting and Mitigating Backdoor Attacks in Graph Neural Networks [30.766013737094532]
We propose DMGNN against out-of-distribution (OOD) and in-distribution (ID) graph backdoor attacks. DMGNN can easily identify the hidden ID and OOD triggers via predicting label transitions based on counterfactual explanation. DMGNN far outperforms the state-of-the-art (SOTA) defense methods, reducing the attack success rate to 5% with almost negligible degradation in model performance.
arXiv Detail & Related papers (2024-10-18T01:08:03Z)
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning. This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities. In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z)
Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features. backdoor attacks subtly embed malicious behaviors within the model during training. We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z)
Hijacking Attacks against Neural Networks by Analyzing Training Data [21.277867143827812]
CleanSheet is a new model hijacking attack that obtains the high performance of backdoor attacks without requiring the adversary to train the model. CleanSheet exploits vulnerabilities in tampers stemming from the training data. Results show that CleanSheet exhibits comparable to state-of-the-art backdoor attacks, achieving an average attack success rate (ASR) of 97.5% on CIFAR-100 and 92.4% on GTSRB.
arXiv Detail & Related papers (2024-01-18T05:48:56Z)
Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation [120.42853706967188]
We explore the potential backdoor attacks on model adaptation launched by well-designed poisoning target data. We propose a plug-and-play method named MixAdapt, combining it with existing adaptation algorithms.
arXiv Detail & Related papers (2024-01-11T16:42:10Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
On the Effectiveness of Adversarial Training against Backdoor Attacks [111.8963365326168]
A backdoored model always predicts a target class in the presence of a predefined trigger pattern. In general, adversarial training is believed to defend against backdoor attacks. We propose a hybrid strategy which provides satisfactory robustness across different backdoor attacks.
arXiv Detail & Related papers (2022-02-22T02:24:46Z)
Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models. In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)
Black-box Detection of Backdoor Attacks with Limited Information and Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model. In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.