Expose Backdoors on the Way: A Feature-Based Efficient Defense against
Textual Backdoor Attacks
- URL: http://arxiv.org/abs/2210.07907v1
- Date: Fri, 14 Oct 2022 15:44:28 GMT
- Title: Expose Backdoors on the Way: A Feature-Based Efficient Defense against
Textual Backdoor Attacks
- Authors: Sishuo Chen, Wenkai Yang, Zhiyuan Zhang, Xiaohan Bi, Xu Sun
- Abstract summary: Prior online backdoor defense methods for NLP models only focus on the anomalies at either the input or output level.
We propose a feature-based efficient online defense method that distinguishes poisoned samples from clean samples at the feature level.
- Score: 20.531489681650154
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing (NLP) models are known to be vulnerable to
backdoor attacks, which poses a newly arisen threat to NLP models. Prior online
backdoor defense methods for NLP models only focus on the anomalies at either
the input or output level, still suffering from fragility to adaptive attacks
and high computational cost. In this work, we take the first step to
investigate the unconcealment of textual poisoned samples at the
intermediate-feature level and propose a feature-based efficient online defense
method. Through extensive experiments on existing attacking methods, we find
that the poisoned samples are far away from clean samples in the intermediate
feature space of a poisoned NLP model. Motivated by this observation, we devise
a distance-based anomaly score (DAN) to distinguish poisoned samples from clean
samples at the feature level. Experiments on sentiment analysis and offense
detection tasks demonstrate the superiority of DAN, as it substantially
surpasses existing online defense methods in terms of defending performance and
enjoys lower inference costs. Moreover, we show that DAN is also resistant to
adaptive attacks based on feature-level regularization. Our code is available
at https://github.com/lancopku/DAN.
Related papers
- Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities.
In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Confidence-driven Sampling for Backdoor Attacks [49.72680157684523]
Backdoor attacks aim to surreptitiously insert malicious triggers into DNN models, granting unauthorized control during testing scenarios.
Existing methods lack robustness against defense strategies and predominantly focus on enhancing trigger stealthiness while randomly selecting poisoned samples.
We introduce a straightforward yet highly effective sampling methodology that leverages confidence scores. Specifically, it selects samples with lower confidence scores, significantly increasing the challenge for defenders in identifying and countering these attacks.
arXiv Detail & Related papers (2023-10-08T18:57:36Z) - ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions.
We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z) - Defending Against Backdoor Attacks by Layer-wise Feature Analysis [11.465401472704732]
Training deep neural networks (DNNs) usually requires massive training data and computational resources.
A new training-time attack (i.e., backdoor attack) aims to induce misclassification of input samples containing adversary-specified trigger patterns.
We propose a simple yet effective method to filter poisoned samples by analyzing the feature differences between suspicious and benign samples at the critical layer.
arXiv Detail & Related papers (2023-02-24T17:16:37Z) - Invisible Backdoor Attacks Using Data Poisoning in the Frequency Domain [8.64369418938889]
We propose a generalized backdoor attack method based on the frequency domain.
It can implement backdoor implantation without mislabeling and accessing the training process.
We evaluate our approach in the no-label and clean-label cases on three datasets.
arXiv Detail & Related papers (2022-07-09T07:05:53Z) - RAP: Robustness-Aware Perturbations for Defending against Backdoor
Attacks on NLP Models [29.71136191379715]
We propose an efficient online defense mechanism based on robustness-aware perturbations.
We construct a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples.
Our method achieves better defending performance and much lower computational costs than existing online defense methods.
arXiv Detail & Related papers (2021-10-15T03:09:26Z) - Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood
Ensemble [163.3333439344695]
Dirichlet Neighborhood Ensemble (DNE) is a randomized smoothing method for training a robust model to defense substitution-based attacks.
DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data.
We demonstrate through extensive experimentation that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
arXiv Detail & Related papers (2020-06-20T18:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.