Eliminating Backdoors in Neural Code Models via Trigger Inversion
- URL: http://arxiv.org/abs/2408.04683v1
- Date: Thu, 8 Aug 2024 08:23:03 GMT
- Title: Eliminating Backdoors in Neural Code Models via Trigger Inversion
- Authors: Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Yang Liu, Baowen Xu, Zhenyu Chen,
- Abstract summary: Backdoor attacks on neural code models pose a significant security threat.
In this paper, we propose a backdoor defense technique based on trigger inversion, called EliBadCode.
We show that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model.
- Score: 24.053091055319562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural code models (NCMs) have been widely used for addressing various code understanding tasks, such as defect detection and clone detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could lead to life safety. However, there is an urgent need for effective defenses against backdoor attacks targeting NCMs. To address this issue, in this paper, we innovatively propose a backdoor defense technique based on trigger inversion, called EliBadCode. EliBadCode first filters the model vocabulary for trigger tokens to reduce the search space for trigger inversion, thereby enhancing the efficiency of the trigger inversion. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of adversarial perturbations for subsequent trigger inversion, thereby producing effective inverted triggers efficiently. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of EliBadCode in eliminating backdoor attacks against multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model.
Related papers
- CodePurify: Defend Backdoor Attacks on Neural Code Models via Entropy-based Purification [19.570958294967536]
backdoor attacks can achieve nearly 100% attack success rates on many software engineering tasks.
We propose CodePurify, a novel defense against backdoor attacks on code models through entropy-based purification.
We extensively evaluate CodePurify against four advanced backdoor attacks across three representative tasks and two popular code models.
arXiv Detail & Related papers (2024-10-26T10:17:50Z) - ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs [17.853862145962292]
We introduce a novel backdoor attack that systematically bypasses system prompts.
Our method achieves an attack success rate (ASR) of up to 99.50% while maintaining a clean accuracy (CACC) of 98.58%.
arXiv Detail & Related papers (2024-10-05T02:58:20Z) - Evolutionary Trigger Detection and Lightweight Model Repair Based Backdoor Defense [10.310546695762467]
Deep Neural Networks (DNNs) have been widely used in many areas such as autonomous driving and face recognition.
A backdoor in the DNN model can be activated by a poisoned input with trigger and leads to wrong prediction.
We propose an efficient backdoor defense based on evolutionary trigger detection and lightweight model repair.
arXiv Detail & Related papers (2024-07-07T14:50:59Z) - T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models [70.03122709795122]
We propose a comprehensive defense method named T2IShield to detect, localize, and mitigate backdoor attacks.
We find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger.
For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9$%$ with low computational cost.
arXiv Detail & Related papers (2024-07-05T01:53:21Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - Reconstructive Neuron Pruning for Backdoor Defense [96.21882565556072]
We propose a novel defense called emphReconstructive Neuron Pruning (RNP) to expose and prune backdoor neurons.
In RNP, unlearning is operated at the neuron level while recovering is operated at the filter level, forming an asymmetric reconstructive learning procedure.
We show that such an asymmetric process on only a few clean samples can effectively expose and prune the backdoor neurons implanted by a wide range of attacks.
arXiv Detail & Related papers (2023-05-24T08:29:30Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z) - UNICORN: A Unified Backdoor Trigger Inversion Framework [13.841110859970827]
Trigger inversion is an effective way of identifying backdoor models and understanding embedded adversarial behaviors.
This work formally defines and analyzes the triggers injected in different spaces and the inversion problem.
Then, it proposes a unified framework to invert backdoor triggers based on the formalization of triggers and the identified inner behaviors of backdoor models.
arXiv Detail & Related papers (2023-04-05T23:14:08Z) - Stealthy Backdoor Attack for Code Models [19.272856932095966]
Existing backdoor attacks on code models use unstealthy and easy-to-detect triggers.
This paper aims to investigate the vulnerability of code models with stealthy backdoor attacks.
We find that around 85% of adaptive triggers in AFRAIDOOR bypass the detection in the defense process.
arXiv Detail & Related papers (2023-01-06T13:15:42Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.