Related papers: Exploring Backdoor Vulnerabilities of Chat Models

Exploring Backdoor Vulnerabilities of Chat Models

URL: http://arxiv.org/abs/2404.02406v1
Date: Wed, 3 Apr 2024 02:16:53 GMT
Title: Exploring Backdoor Vulnerabilities of Chat Models
Authors: Yunzhuo Hao, Wenkai Yang, Yankai Lin,
Abstract summary: Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. This paper presents a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds. Experimental results demonstrate that our method can achieve high attack success rates while successfully maintaining the normal capabilities of chat models.
Score: 31.802374847226393
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds, and making the backdoor be triggered only when all trigger scenarios have appeared in the historical conversations. Experimental results demonstrate that our method can achieve high attack success rates (e.g., over 90% ASR on Vicuna-7B) while successfully maintaining the normal capabilities of chat models on providing helpful responses to benign user requests. Also, the backdoor can not be easily removed by the downstream re-alignment, highlighting the importance of continued research and attention to the security concerns of chat models. Warning: This paper may contain toxic content.

Related papers

Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models [79.36881186707413]
Multi-modal large language models (MLLMs) process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. We propose BadToken, the first token-level backdoor attack to MLLMs.
arXiv Detail & Related papers (2025-03-20T10:39:51Z)
Neutralizing Backdoors through Information Conflicts for Large Language Models [20.6331157117675]
We present a novel method to eliminate backdoor behaviors from large language models (LLMs) We leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy.
arXiv Detail & Related papers (2024-11-27T12:15:22Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers [29.554818890832887]
Large language models (LLMs) have acquired the ability to handle longer context lengths and understand nuances in text. This paper exposes a vulnerability that leverages the multi-turn feature and strong learning ability of LLMs to harm the end-user. We propose a decoding time defense that scales linearly with assistant response sequence length and reduces the backdoor to as low as 0.35%.
arXiv Detail & Related papers (2024-07-04T20:57:06Z)
Backdoor Removal for Generative Large Language Models [42.19147076519423]
generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. We present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs.
arXiv Detail & Related papers (2024-05-13T11:53:42Z)
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models [15.381273199132433]
BadChain is the first backdoor attack against large language models (LLMs) employing chain-of-thought (COT) prompting. We show the effectiveness of BadChain for two COT strategies and six benchmark tasks. BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
arXiv Detail & Related papers (2024-01-20T04:53:35Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE [51.287157951953226]
Language models are often at risk of diverse backdoor attacks, especially data poisoning. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers. We propose an end-to-end ensemble-based backdoor defense framework, DPoE, to defend various backdoor attacks.
arXiv Detail & Related papers (2023-05-24T08:59:25Z)
Backdoor Learning on Sequence to Sequence Models [94.23904400441957]
In this paper, we study whether sequence-to-sequence (seq2seq) models are vulnerable to backdoor attacks. Specifically, we find by only injecting 0.2% samples of the dataset, we can cause the seq2seq model to generate the designated keyword and even the whole sentence. Extensive experiments on machine translation and text summarization have been conducted to show our proposed methods could achieve over 90% attack success rate on multiple datasets and models.
arXiv Detail & Related papers (2023-05-03T20:31:13Z)
Backdoor Attacks on Crowd Counting [63.90533357815404]
Crowd counting is a regression task that estimates the number of people in a scene image. In this paper, we investigate the vulnerability of deep learning based crowd counting models to backdoor attacks.
arXiv Detail & Related papers (2022-07-12T16:17:01Z)
Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models. In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)
Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution [57.51117978504175]
Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks. Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated. We present invisible backdoors that are activated by a learnable combination of word substitution.
arXiv Detail & Related papers (2021-06-11T13:03:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.