Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers
- URL: http://arxiv.org/abs/2407.04151v1
- Date: Thu, 4 Jul 2024 20:57:06 GMT
- Title: Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers
- Authors: Terry Tong, Jiashu Xu, Qin Liu, Muhao Chen,
- Abstract summary: Multi-turn conversational large language models (LLMs) are vulnerable to data poisoning backdoor attacks.
LLMs are at the risk of even more harmful and stealthy backdoor attacks where the backdoor triggers may span across multiple utterances.
We propose a new defense strategy for the multi-turn dialogue setting that is more challenging.
- Score: 29.554818890832887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The security of multi-turn conversational large language models (LLMs) is understudied despite it being one of the most popular LLM utilization. Specifically, LLMs are vulnerable to data poisoning backdoor attacks, where an adversary manipulates the training data to cause the model to output malicious responses to predefined triggers. Specific to the multi-turn dialogue setting, LLMs are at the risk of even more harmful and stealthy backdoor attacks where the backdoor triggers may span across multiple utterances, giving lee-way to context-driven attacks. In this paper, we explore a novel distributed backdoor trigger attack that serves to be an extra tool in an adversary's toolbox that can interface with other single-turn attack strategies in a plug and play manner. Results on two representative defense mechanisms indicate that distributed backdoor triggers are robust against existing defense strategies which are designed for single-turn user-model interactions, motivating us to propose a new defense strategy for the multi-turn dialogue setting that is more challenging. To this end, we also explore a novel contrastive decoding based defense that is able to mitigate the backdoor with a low computational tradeoff.
Related papers
- Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
We explore converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures [25.381528717141684]
Large language models (LLMs) bridge the gap between human language understanding and complex problem-solving.
LLMs are susceptible to potential security vulnerabilities, particularly in backdoor attacks.
This paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods.
arXiv Detail & Related papers (2024-06-10T23:54:21Z) - TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models [16.71019302192829]
Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP)
Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized.
We propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-05-22T07:21:32Z) - Backdoor Removal for Generative Large Language Models [42.19147076519423]
generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning.
A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data.
We present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs.
arXiv Detail & Related papers (2024-05-13T11:53:42Z) - VL-Trojan: Multimodal Instruction Backdoor Attacks against
Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context.
Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities.
Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images.
We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z) - Multi-Trigger Backdoor Attacks: More Triggers, More Threats [71.08081471803915]
We investigate the practical threat of backdoor attacks under the setting of textbfmulti-trigger attacks
By proposing and investigating three types of multi-trigger attacks, we provide a set of important understandings of the coexisting, overwriting, and cross-activating effects between different triggers on the same dataset.
We create a multi-trigger backdoor poisoning dataset to help future evaluation of backdoor attacks and defenses.
arXiv Detail & Related papers (2024-01-27T04:49:37Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models [17.52386568785587]
Prompt-based learning is vulnerable to backdoor attacks.
We propose transferable backdoor attacks against prompt-based models, called NOTABLE.
Notable injects backdoors into the encoders of PLMs by utilizing an adaptiver to bind triggers to specific words.
arXiv Detail & Related papers (2023-05-28T23:35:17Z) - From Shortcuts to Triggers: Backdoor Defense with Denoised PoE [51.287157951953226]
Language models are often at risk of diverse backdoor attacks, especially data poisoning.
Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers.
We propose an end-to-end ensemble-based backdoor defense framework, DPoE, to defend various backdoor attacks.
arXiv Detail & Related papers (2023-05-24T08:59:25Z) - On the Effectiveness of Adversarial Training against Backdoor Attacks [111.8963365326168]
A backdoored model always predicts a target class in the presence of a predefined trigger pattern.
In general, adversarial training is believed to defend against backdoor attacks.
We propose a hybrid strategy which provides satisfactory robustness across different backdoor attacks.
arXiv Detail & Related papers (2022-02-22T02:24:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.