BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
- URL: http://arxiv.org/abs/2401.12242v1
- Date: Sat, 20 Jan 2024 04:53:35 GMT
- Title: BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
- Authors: Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha
Poovendran, Bo Li
- Abstract summary: BadChain is the first backdoor attack against large language models (LLMs) employing chain-of-thought (COT) prompting.
We show the effectiveness of BadChain for two COT strategies and six benchmark tasks.
BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
- Score: 15.381273199132433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are shown to benefit from chain-of-thought (COT)
prompting, particularly when tackling tasks that require systematic reasoning
processes. On the other hand, COT prompting also poses new vulnerabilities in
the form of backdoor attacks, wherein the model will output unintended
malicious content under specific backdoor-triggered conditions during
inference. Traditional methods for launching backdoor attacks involve either
contaminating the training dataset with backdoored instances or directly
manipulating the model parameters during deployment. However, these approaches
are not practical for commercial LLMs that typically operate via API access. In
this paper, we propose BadChain, the first backdoor attack against LLMs
employing COT prompting, which does not require access to the training dataset
or model parameters and imposes low computational overhead. BadChain leverages
the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning
step into the sequence of reasoning steps of the model output, thereby altering
the final response when a backdoor trigger exists in the query prompt.
Empirically, we show the effectiveness of BadChain for two COT strategies
across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark
tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover,
we show that LLMs endowed with stronger reasoning capabilities exhibit higher
susceptibility to BadChain, exemplified by a high average attack success rate
of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two
defenses based on shuffling and demonstrate their overall ineffectiveness
against BadChain. Therefore, BadChain remains a severe threat to LLMs,
underscoring the urgency for the development of robust and effective future
defenses.
Related papers
- When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are vulnerable to backdoor attacks.
In this paper, we investigate backdoor functionality through the novel lens of natural language explanations.
arXiv Detail & Related papers (2024-11-19T18:11:36Z) - ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs [17.853862145962292]
We introduce a novel backdoor attack that systematically bypasses system prompts.
Our method achieves an attack success rate (ASR) of up to 99.50% while maintaining a clean accuracy (CACC) of 98.58%.
arXiv Detail & Related papers (2024-10-05T02:58:20Z) - MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities.
Their powerful generative abilities enable flexible responses based on various queries or instructions.
This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z) - Revisiting Backdoor Attacks against Large Vision-Language Models [76.42014292255944]
This paper empirically examines the generalizability of backdoor attacks during the instruction tuning of LVLMs.
We modify existing backdoor attacks based on the above key observations.
This paper underscores that even simple traditional backdoor strategies pose a serious threat to LVLMs.
arXiv Detail & Related papers (2024-06-27T02:31:03Z) - A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures [28.604839267949114]
Large Language Models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks.
Research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks.
This paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods.
arXiv Detail & Related papers (2024-06-10T23:54:21Z) - Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models [35.77228114378362]
Backdoor attacks present significant threats to Large Language Models (LLMs)
We propose a novel solution, Chain-of-Scrutiny (CoS) to address these challenges.
CoS guides the LLMs to generate detailed reasoning steps for the input, then scrutinizes the reasoning process to ensure consistency with the final answer.
arXiv Detail & Related papers (2024-06-10T00:53:25Z) - TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models [16.71019302192829]
Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP)
Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized.
We propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-05-22T07:21:32Z) - Backdoor Removal for Generative Large Language Models [42.19147076519423]
generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning.
A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data.
We present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs.
arXiv Detail & Related papers (2024-05-13T11:53:42Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.