Rethinking Backdoor Detection Evaluation for Language Models
- URL: http://arxiv.org/abs/2409.00399v1
- Date: Sat, 31 Aug 2024 09:19:39 GMT
- Title: Rethinking Backdoor Detection Evaluation for Language Models
- Authors: Jun Yan, Wenjie Jacky Mo, Xiang Ren, Robin Jia,
- Abstract summary: Backdoor attacks pose a major security risk for practitioners who depend on publicly released language models.
Backdoor detection methods aim to detect whether a released model contains a backdoor, so that practitioners can avoid such vulnerabilities.
While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild.
- Score: 45.34806299803778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. Backdoor detection methods aim to detect whether a released model contains a backdoor, so that practitioners can avoid such vulnerabilities. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods highly depends on how intensely the model is trained on poisoned data during backdoor planting. Specifically, backdoors planted with either more aggressive or more conservative training are significantly more difficult to detect than the default ones. Our results highlight a lack of robustness of existing backdoor detectors and the limitations in current benchmark construction.
Related papers
- Towards Robust Object Detection: Identifying and Removing Backdoors via Module Inconsistency Analysis [5.8634235309501435]
We propose a backdoor defense framework tailored to object detection models.
By quantifying and analyzing inconsistencies, we develop an algorithm to detect backdoors.
Experiments with state-of-the-art two-stage object detectors show our method achieves a 90% improvement in backdoor removal rate.
arXiv Detail & Related papers (2024-09-24T12:58:35Z) - Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks [63.269788236474234]
We propose to use model pairs on open-set classification tasks for detecting backdoors.
We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures.
This technique allows for the detection of backdoors on models designed for open-set classification tasks, which is little studied in the literature.
arXiv Detail & Related papers (2024-02-28T21:29:16Z) - BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input
Detection [42.021282816470794]
We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs)
Our defense falls within the category of post-development defenses that operate independently of how the model was generated.
We show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference.
arXiv Detail & Related papers (2023-08-23T21:47:06Z) - Universal Soldier: Using Universal Adversarial Perturbations for
Detecting Backdoor Attacks [15.917794562400449]
A deep learning model may be poisoned by training with backdoored data or by modifying inner network parameters.
It is difficult to distinguish between clean and backdoored models without prior knowledge of the trigger.
We propose a novel method called Universal Soldier for Backdoor detection (USB) and reverse engineering potential backdoor triggers via UAPs.
arXiv Detail & Related papers (2023-02-01T20:47:58Z) - Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics.
We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z) - Confidence Matters: Inspecting Backdoors in Deep Neural Networks via
Distribution Transfer [27.631616436623588]
We propose a backdoor defense DTInspector built upon a new observation.
DTInspector learns a patch that could change the predictions of most high-confidence data, and then decides the existence of backdoor.
arXiv Detail & Related papers (2022-08-13T08:16:28Z) - Check Your Other Door! Establishing Backdoor Attacks in the Frequency
Domain [80.24811082454367]
We show the advantages of utilizing the frequency domain for establishing undetectable and powerful backdoor attacks.
We also show two possible defences that succeed against frequency-based backdoor attacks and possible ways for the attacker to bypass them.
arXiv Detail & Related papers (2021-09-12T12:44:52Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z) - Backdoor Learning: A Survey [75.59571756777342]
Backdoor attack intends to embed hidden backdoor into deep neural networks (DNNs)
Backdoor learning is an emerging and rapidly growing research area.
This paper presents the first comprehensive survey of this realm.
arXiv Detail & Related papers (2020-07-17T04:09:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.