Detecting Backdoors in Deep Text Classifiers
- URL: http://arxiv.org/abs/2210.11264v1
- Date: Tue, 11 Oct 2022 07:48:03 GMT
- Title: Detecting Backdoors in Deep Text Classifiers
- Authors: You Guo and Jun Wang and Trevor Cohn
- Abstract summary: We present the first robust defence mechanism that generalizes to several backdoor attacks against text classification models.
Our technique is highly accurate at defending against state-of-the-art backdoor attacks, including data poisoning and weight poisoning.
- Score: 43.36440869257781
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep neural networks are vulnerable to adversarial attacks, such as backdoor
attacks in which a malicious adversary compromises a model during training such
that specific behaviour can be triggered at test time by attaching a specific
word or phrase to an input. This paper considers the problem of diagnosing
whether a model has been compromised and if so, identifying the backdoor
trigger. We present the first robust defence mechanism that generalizes to
several backdoor attacks against text classification models, without prior
knowledge of the attack type, nor does our method require access to any
(potentially compromised) training resources. Our experiments show that our
technique is highly accurate at defending against state-of-the-art backdoor
attacks, including data poisoning and weight poisoning, across a range of text
classification tasks and model architectures. Our code will be made publicly
available upon acceptance.
Related papers
- Rethinking Backdoor Attacks [122.1008188058615]
In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation.
Defending against such attacks typically involves viewing these inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them.
We show that without structural information about the training data distribution, backdoor attacks are indistinguishable from naturally-occurring features in the data.
arXiv Detail & Related papers (2023-07-19T17:44:54Z) - Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics.
We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z) - Contributor-Aware Defenses Against Adversarial Backdoor Attacks [2.830541450812474]
adversarial backdoor attacks have demonstrated the capability to perform targeted misclassification of specific examples.
We propose a contributor-aware universal defensive framework for learning in the presence of multiple, potentially adversarial data sources.
Our empirical studies demonstrate the robustness of the proposed framework against adversarial backdoor attacks from multiple simultaneous adversaries.
arXiv Detail & Related papers (2022-05-28T20:25:34Z) - MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary
Backdoor Pattern Types Using a Maximum Margin Statistic [27.62279831135902]
We propose a post-training defense that detects backdoor attacks with arbitrary types of backdoor embeddings.
Our detector does not need any legitimate clean samples, and can efficiently detect backdoor attacks with arbitrary numbers of source classes.
arXiv Detail & Related papers (2022-05-13T21:32:24Z) - On the Effectiveness of Adversarial Training against Backdoor Attacks [111.8963365326168]
A backdoored model always predicts a target class in the presence of a predefined trigger pattern.
In general, adversarial training is believed to defend against backdoor attacks.
We propose a hybrid strategy which provides satisfactory robustness across different backdoor attacks.
arXiv Detail & Related papers (2022-02-22T02:24:46Z) - Excess Capacity and Backdoor Poisoning [11.383869751239166]
A backdoor data poisoning attack is an adversarial attack wherein the attacker injects several watermarked, mislabeled training examples into a training set.
We present a formal theoretical framework within which one can discuss backdoor data poisoning attacks for classification problems.
arXiv Detail & Related papers (2021-09-02T03:04:38Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z) - Backdoor Learning: A Survey [75.59571756777342]
Backdoor attack intends to embed hidden backdoor into deep neural networks (DNNs)
Backdoor learning is an emerging and rapidly growing research area.
This paper presents the first comprehensive survey of this realm.
arXiv Detail & Related papers (2020-07-17T04:09:20Z) - Mitigating backdoor attacks in LSTM-based Text Classification Systems by
Backdoor Keyword Identification [0.0]
In text classification systems, backdoors inserted in the models can cause spam or malicious speech to escape detection.
In this paper, through analyzing the changes in inner LSTM neurons, we proposed a defense method called Backdoor Keyword Identification (BKI) to mitigate backdoor attacks.
We evaluate our method on four different text classification datset: IMDB, DBpedia, 20 newsgroups and Reuters-21578 dataset.
arXiv Detail & Related papers (2020-07-11T09:05:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.