Related papers: BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

URL: http://arxiv.org/abs/2401.12242v1
Date: Sat, 20 Jan 2024 04:53:35 GMT
Title: BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Authors: Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li
Abstract summary: BadChain is the first backdoor attack against large language models (LLMs) employing chain-of-thought (COT) prompting. We show the effectiveness of BadChain for two COT strategies and six benchmark tasks. BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
Score: 15.381273199132433
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. Traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. However, these approaches are not practical for commercial LLMs that typically operate via API access. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.

Related papers

Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models [74.1970982768771]
We show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs.<n>We introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification)
arXiv Detail & Related papers (2026-02-24T15:47:52Z)
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution [49.78359632298156]
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks.<n> backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated.<n>We present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution.
arXiv Detail & Related papers (2025-08-28T17:05:18Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models [79.36881186707413]
Multi-modal large language models (MLLMs) process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. We propose BadToken, the first token-level backdoor attack to MLLMs.
arXiv Detail & Related papers (2025-03-20T10:39:51Z)
Neutralizing Backdoors through Information Conflicts for Large Language Models [20.6331157117675]
We present a novel method to eliminate backdoor behaviors from large language models (LLMs) We leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy.
arXiv Detail & Related papers (2024-11-27T12:15:22Z)
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are vulnerable to backdoor attacks. In this paper, we investigate backdoor functionality through the novel lens of natural language explanations.
arXiv Detail & Related papers (2024-11-19T18:11:36Z)
ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs [17.853862145962292]
We introduce a novel backdoor attack that systematically bypasses system prompts. Our method achieves an attack success rate (ASR) of up to 99.50% while maintaining a clean accuracy (CACC) of 98.58%.
arXiv Detail & Related papers (2024-10-05T02:58:20Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
Defending Code Language Models against Backdoor Attacks with Deceptive Cross-Entropy Loss [26.24490960002264]
Code Language Models (CLMs) have achieved significant success in code intelligence domain.<n>The issue of security, particularly backdoor attacks, is often overlooked in this process.<n>Previous research has focused on designing backdoor attacks for CLMs, but effective defenses have not been adequately addressed.
arXiv Detail & Related papers (2024-07-12T03:18:38Z)
Revisiting Backdoor Attacks against Large Vision-Language Models [76.42014292255944]
This paper empirically examines the generalizability of backdoor attacks during the instruction tuning of LVLMs. We modify existing backdoor attacks based on the above key observations. This paper underscores that even simple traditional backdoor strategies pose a serious threat to LVLMs.
arXiv Detail & Related papers (2024-06-27T02:31:03Z)
A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures [28.604839267949114]
Large Language Models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks. Research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. This paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods.
arXiv Detail & Related papers (2024-06-10T23:54:21Z)
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models [35.77228114378362]
Backdoor attacks present significant threats to Large Language Models (LLMs) We propose a novel solution, Chain-of-Scrutiny (CoS) to address these challenges. CoS guides the LLMs to generate detailed reasoning steps for the input, then scrutinizes the reasoning process to ensure consistency with the final answer.
arXiv Detail & Related papers (2024-06-10T00:53:25Z)
TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models [16.71019302192829]
Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP) Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. We propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-05-22T07:21:32Z)
Backdoor Removal for Generative Large Language Models [42.19147076519423]
generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. We present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs.
arXiv Detail & Related papers (2024-05-13T11:53:42Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks. We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z)
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE [51.287157951953226]
Language models are often at risk of diverse backdoor attacks, especially data poisoning. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers. We propose an end-to-end ensemble-based backdoor defense framework, DPoE, to defend various backdoor attacks.
arXiv Detail & Related papers (2023-05-24T08:59:25Z)
Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure. We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.