SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation
- URL: http://arxiv.org/abs/2412.05829v2
- Date: Sun, 09 Mar 2025 16:31:10 GMT
- Title: SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation
- Authors: Naizhu Jin, Zhong Li, Yinggang Guo, Chao Su, Tian Zhang, Qingkai Zeng,
- Abstract summary: Chain-of-Thought (CoT) reasoning is proposed to further enhance the reliability of Code Language Models (CLMs)<n>CoT models are designed to integrate CoT reasoning effectively into language models, achieving notable improvements in code generation.<n>This study investigates the vulnerability of CoT models to backdoor injection in code generation tasks.
- Score: 15.274903870635095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have proposed integrating Chain-of-Thought (CoT) reasoning to further enhance the reliability of Code Language Models (CLMs) in generating code, a step-by-step approach that breaks down complex programming tasks into manageable sub-problems. Advances in this area have introduced CoT models, specifically designed to integrate CoT reasoning effectively into language models, achieving notable improvements in code generation. Despite these advancements, the security of CoT models has not been systematically studied. In this study, we aim to fill this gap by investigating the vulnerability of CoT models to backdoor injection in code generation tasks. To address this, we propose a model-agnostic backdoor attack method SABER (Self-Attention-BasEd backdooR) based on the self-attention mechanism. SABER begins by selecting a malicious output as the backdoor using code mutation operations. It then identifies the tokens most relevant to poisoned content by analyzing self-attention scores in the CodeBERT model. Finally, it mimicks user behavior to generate adaptive and natural triggers. Our experiments on HumanEval-CoT and OpenEval-CoT test sets demonstrate that CoT models are susceptible to backdoor attacks via data poisoning. Taking the HumanEval-CoT dataset as an example, SABER achieves an ASR of 80.95%, representing an improvement of 33.33% over RIPPLe and a substantial 4.76% enhancement compared to BadPre. Further evaluations using ONION for automated detection and human studies reveal that SABER is stealthier and harder to detect, bypassing 61.90% of automated detection, with a human detection rate of just 3.17%. Our findings reveal that backdoors can be injected into CoT models to manipulate downstream code generation tasks. This highlights the urgent need for further research to understand and mitigate the security vulnerabilities in CoT models.
Related papers
- InverTune: Removing Backdoors from Multimodal Contrastive Learning Models via Trigger Inversion and Activation Tuning [36.56302680556252]
We introduce InverTune, the first backdoor defense framework for multimodal models under minimal attacker assumptions.<n>InverTune effectively identifies and removes backdoor artifacts through three key components, achieving robust protection against backdoor attacks.<n> Experimental results show that InverTune reduces the average attack success rate (ASR) by 97.87% against the state-of-the-art (SOTA) attacks.
arXiv Detail & Related papers (2025-06-14T09:08:34Z) - GUARD:Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation [17.36458017234638]
GUARD is a novel dual-agent defense framework designed to counter CoT backdoor attacks in neural code generation.<n>GUARD integrates two core components: GUARD-Judge, which identifies suspicious CoT steps and potential triggers through comprehensive analysis, and GUARD-Repair, which employs a retrieval-augmented generation approach to regenerate secure CoT steps.
arXiv Detail & Related papers (2025-05-27T16:55:46Z) - AI-Based Vulnerability Analysis of NFT Smart Contracts [6.378351117969227]
This study proposes an AI-driven approach to detect vulnerabilities in NFT smart contracts.
We collected 16,527 public smart contract codes, classifying them into five vulnerability categories: Risky Mutable Proxy, ERC-721 Reentrancy, Unlimited Minting, Missing Requirements, and Public Burn.
A random forest model was implemented to improve robustness through random data/feature sampling and multitree integration.
arXiv Detail & Related papers (2025-04-18T08:55:31Z) - Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs [10.252989233081395]
This paper presents the first systematic study of malicious code poisoning attacks on pre-trained model hubs, focusing on the Hugging Face platform.
We propose MalHug, an end-to-end pipeline tailored for Hugging Face that combines dataset loading script extraction, model deserialization, and taint pattern matching.
MalHug has monitored more than 705K models and 176K datasets, uncovering 91 malicious models and 9 malicious dataset loading scripts.
arXiv Detail & Related papers (2024-09-14T08:47:22Z) - T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models [70.03122709795122]
We propose a comprehensive defense method named T2IShield to detect, localize, and mitigate backdoor attacks.
We find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger.
For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9$%$ with low computational cost.
arXiv Detail & Related papers (2024-07-05T01:53:21Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - IBD-PSC: Input-level Backdoor Detection via Parameter-oriented Scaling Consistency [20.61046457594186]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
This paper proposes a simple yet effective input-level backdoor detection (dubbed IBD-PSC) to filter out malicious testing images.
arXiv Detail & Related papers (2024-05-16T03:19:52Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge [17.3048898399324]
democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies.
backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability.
This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities.
arXiv Detail & Related papers (2024-02-29T16:37:08Z) - Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared
Adversarial Examples [67.66153875643964]
Backdoor attacks are serious security threats to machine learning models.
In this paper, we explore the task of purifying a backdoored model using a small clean dataset.
By establishing the connection between backdoor risk and adversarial risk, we derive a novel upper bound for backdoor risk.
arXiv Detail & Related papers (2023-07-20T03:56:04Z) - Backdoor Learning on Sequence to Sequence Models [94.23904400441957]
In this paper, we study whether sequence-to-sequence (seq2seq) models are vulnerable to backdoor attacks.
Specifically, we find by only injecting 0.2% samples of the dataset, we can cause the seq2seq model to generate the designated keyword and even the whole sentence.
Extensive experiments on machine translation and text summarization have been conducted to show our proposed methods could achieve over 90% attack success rate on multiple datasets and models.
arXiv Detail & Related papers (2023-05-03T20:31:13Z) - Detecting Backdoors During the Inference Stage Based on Corruption
Robustness Consistency [33.42013309686333]
We propose a test-time trigger sample detection method that only needs the hard-label outputs of the victim models without any extra information.
Our journey begins with the intriguing observation that the backdoor-infected models have similar performance across different image corruptions for the clean images, but perform discrepantly for the trigger samples.
Extensive experiments demonstrate that compared with state-of-the-art defenses, TeCo outperforms them on different backdoor attacks, datasets, and model architectures.
arXiv Detail & Related papers (2023-03-27T07:10:37Z) - Backdoor Defense via Deconfounded Representation Learning [17.28760299048368]
We propose a Causality-inspired Backdoor Defense (CBD) to learn deconfounded representations for reliable classification.
CBD is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples.
arXiv Detail & Related papers (2023-03-13T02:25:59Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - CrowdGuard: Federated Backdoor Detection in Federated Learning [39.58317527488534]
This paper presents a novel defense mechanism, CrowdGuard, that effectively mitigates backdoor attacks in Federated Learning.
CrowdGuard employs a server-located stacked clustering scheme to enhance its resilience to rogue client feedback.
The evaluation results demonstrate that CrowdGuard achieves a 100% True-Positive-Rate and True-Negative-Rate across various scenarios.
arXiv Detail & Related papers (2022-10-14T11:27:49Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.