Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor
- URL: http://arxiv.org/abs/2409.01952v2
- Date: Mon, 9 Sep 2024 15:37:15 GMT
- Title: Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor
- Authors: Abdullah Arafat Miah, Yu Bi,
- Abstract summary: We introduce a new type of backdoor attack that conceals itself within the underlying model architecture.
The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights.
We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets.
- Score: 0.24335447922683692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker's choice. While such black-box attacks have been well explored in both computer vision and natural language processing (NLP), backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we propose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights using Gaussian noise to disturb the feature distribution of the baseline model. We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets. We demonstrate that the training-free architectural backdoor on a large language model poses a genuine threat. Unlike the-state-of-art work, it can survive the rigorous fine-tuning and retraining process, as well as evade output probability-based defense methods (i.e. BDDR). All the code and data is available https://github.com/SiSL-URI/Arch_Backdoor_LLM.
Related papers
- Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models [68.40324627475499]
We introduce a novel two-step defense framework named Expose Before You Defend.
EBYD unifies existing backdoor defense methods into a comprehensive defense system with enhanced performance.
We conduct extensive experiments on 10 image attacks and 6 text attacks across 2 vision datasets and 4 language datasets.
arXiv Detail & Related papers (2024-10-25T09:36:04Z) - Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor [63.84477483795964]
Data-poisoning backdoor attacks are serious security threats to machine learning models.
In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned.
We propose a novel defense approach called PDB (Proactive Defensive Backdoor)
arXiv Detail & Related papers (2024-05-25T07:52:26Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z) - Backdoor Defense via Deconfounded Representation Learning [17.28760299048368]
We propose a Causality-inspired Backdoor Defense (CBD) to learn deconfounded representations for reliable classification.
CBD is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples.
arXiv Detail & Related papers (2023-03-13T02:25:59Z) - BackdoorBox: A Python Toolbox for Backdoor Learning [67.53987387581222]
This Python toolbox implements representative and advanced backdoor attacks and defenses.
It allows researchers and developers to easily implement and compare different methods on benchmark or their local datasets.
arXiv Detail & Related papers (2023-02-01T09:45:42Z) - Architectural Backdoors in Neural Networks [27.315196801989032]
We introduce a new class of backdoor attacks that hide inside model architectures.
These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture.
We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch.
arXiv Detail & Related papers (2022-06-15T22:44:03Z) - Check Your Other Door! Establishing Backdoor Attacks in the Frequency
Domain [80.24811082454367]
We show the advantages of utilizing the frequency domain for establishing undetectable and powerful backdoor attacks.
We also show two possible defences that succeed against frequency-based backdoor attacks and possible ways for the attacker to bypass them.
arXiv Detail & Related papers (2021-09-12T12:44:52Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.