Related papers: Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers

Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers

URL: http://arxiv.org/abs/2407.04151v1
Date: Thu, 4 Jul 2024 20:57:06 GMT
Title: Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers
Authors: Terry Tong, Jiashu Xu, Qin Liu, Muhao Chen,
Abstract summary: Multi-turn conversational large language models (LLMs) are vulnerable to data poisoning backdoor attacks. LLMs are at the risk of even more harmful and stealthy backdoor attacks where the backdoor triggers may span across multiple utterances. We propose a new defense strategy for the multi-turn dialogue setting that is more challenging.
Score: 29.554818890832887
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The security of multi-turn conversational large language models (LLMs) is understudied despite it being one of the most popular LLM utilization. Specifically, LLMs are vulnerable to data poisoning backdoor attacks, where an adversary manipulates the training data to cause the model to output malicious responses to predefined triggers. Specific to the multi-turn dialogue setting, LLMs are at the risk of even more harmful and stealthy backdoor attacks where the backdoor triggers may span across multiple utterances, giving lee-way to context-driven attacks. In this paper, we explore a novel distributed backdoor trigger attack that serves to be an extra tool in an adversary's toolbox that can interface with other single-turn attack strategies in a plug and play manner. Results on two representative defense mechanisms indicate that distributed backdoor triggers are robust against existing defense strategies which are designed for single-turn user-model interactions, motivating us to propose a new defense strategy for the multi-turn dialogue setting that is more challenging. To this end, we also explore a novel contrastive decoding based defense that is able to mitigate the backdoor with a low computational tradeoff.

Related papers

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models [17.839413035304748]
Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger.<n>We introduce BEAT, a black-box defense that detects triggered samples during inference to deactivate the backdoored LLM.<n>Our method addresses the challenges of sample-dependent targets from an opposite perspective.
arXiv Detail & Related papers (2025-06-19T16:30:56Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Large Language Models Can Verbatim Reproduce Long Malicious Sequences [23.0516001201445]
Backdoor attacks on machine learning models have been extensively studied. This paper re-examines the concept of backdoor attacks in the context of Large Language Models. We find that arbitrary responses containing hard coded keys of $leq100$ random characters can be reproduced when triggered by a target input.
arXiv Detail & Related papers (2025-03-21T23:24:49Z)
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models [79.36881186707413]
Multi-modal large language models (MLLMs) process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. We propose BadToken, the first token-level backdoor attack to MLLMs.
arXiv Detail & Related papers (2025-03-20T10:39:51Z)
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are vulnerable to backdoor attacks. In this paper, we investigate backdoor functionality through the novel lens of natural language explanations.
arXiv Detail & Related papers (2024-11-19T18:11:36Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism. Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses. We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z)
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z)
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models [35.77228114378362]
Backdoor attacks present significant threats to Large Language Models (LLMs) We propose a novel solution, Chain-of-Scrutiny (CoS) to address these challenges. CoS guides the LLMs to generate detailed reasoning steps for the input, then scrutinizes the reasoning process to ensure consistency with the final answer.
arXiv Detail & Related papers (2024-06-10T00:53:25Z)
SirLLM: Streaming Infinite Retentive LLM [74.40196814292426]
Large Language Models (LLMs) process inputs of any length and maintain a degree of memory. Recent efforts have employed streaming inputs to alleviate the pressure of excessively long text inputs. We introduce Streaming Infinite Retentive LLM (SirLLM), which allows LLMs to maintain longer memory during infinite-length dialogues.
arXiv Detail & Related papers (2024-05-21T06:37:03Z)
Backdoor Removal for Generative Large Language Models [42.19147076519423]
generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. We present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs.
arXiv Detail & Related papers (2024-05-13T11:53:42Z)
Exploring Backdoor Vulnerabilities of Chat Models [31.802374847226393]
Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. This paper presents a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds. Experimental results demonstrate that our method can achieve high attack success rates while successfully maintaining the normal capabilities of chat models.
arXiv Detail & Related papers (2024-04-03T02:16:53Z)
LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models [83.98062659664785]
Large language models (LLMs) typically train on short text segments (e.g., 4K tokens) due to the quadratic complexity of their Transformer architectures. This work identifies three major factors contributing to this length generalization failure. We propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.
arXiv Detail & Related papers (2023-08-30T16:47:51Z)
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE [51.287157951953226]
Language models are often at risk of diverse backdoor attacks, especially data poisoning. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers. We propose an end-to-end ensemble-based backdoor defense framework, DPoE, to defend various backdoor attacks.
arXiv Detail & Related papers (2023-05-24T08:59:25Z)
Backdoor Attacks with Input-unique Triggers in NLP [34.98477726215485]
Backdoor attack aims at inducing neural models to make incorrect predictions for poison data while keeping predictions on the clean dataset unchanged. In this paper, we propose an input-unique backdoor attack(NURA), where we generate backdoor triggers unique to inputs.
arXiv Detail & Related papers (2023-03-25T01:41:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.