Related papers: TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

URL: http://arxiv.org/abs/2405.13401v4
Date: Sun, 7 Jul 2024 05:03:53 GMT
Title: TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models
Authors: Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, Gongshen Liu,
Abstract summary: Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP) Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. We propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation.
Score: 16.71019302192829
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries.

Related papers

Large Language Models Can Verbatim Reproduce Long Malicious Sequences [23.0516001201445]
Backdoor attacks on machine learning models have been extensively studied. This paper re-examines the concept of backdoor attacks in the context of Large Language Models. We find that arbitrary responses containing hard coded keys of $leq100$ random characters can be reproduced when triggered by a target input.
arXiv Detail & Related papers (2025-03-21T23:24:49Z)
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks. $textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning. $textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z)
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are vulnerable to backdoor attacks. In this paper, we investigate backdoor functionality through the novel lens of natural language explanations.
arXiv Detail & Related papers (2024-11-19T18:11:36Z)
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment [23.460024089845408]
We propose AdvBDGen, an adversarially fortified generative fine-tuning framework that automatically generates prompt-specific backdoors. AdvBDGen employs a generator-discriminator pair, fortified by an adversary, to ensure the installability and stealthiness of backdoors.
arXiv Detail & Related papers (2024-10-15T05:05:56Z)
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models [27.59116619946915]
We introduce textitBackdoorLLM, the first comprehensive benchmark for studying backdoor attacks on Generative Large Language Models. textitBackdoorLLM features: 1) a repository of backdoor benchmarks with a standardized training pipeline, 2) diverse attack strategies, including data poisoning, weight poisoning, hidden state attacks, and chain-of-thought attacks, and 3) extensive evaluations with over 200 experiments on 8 attacks across 7 scenarios and 6 model architectures.
arXiv Detail & Related papers (2024-08-23T02:21:21Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z)
Revisiting Backdoor Attacks against Large Vision-Language Models [76.42014292255944]
This paper empirically examines the generalizability of backdoor attacks during the instruction tuning of LVLMs. We modify existing backdoor attacks based on the above key observations. This paper underscores that even simple traditional backdoor strategies pose a serious threat to LVLMs.
arXiv Detail & Related papers (2024-06-27T02:31:03Z)
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models [35.77228114378362]
Backdoor attacks present significant threats to Large Language Models (LLMs) We propose a novel solution, Chain-of-Scrutiny (CoS) to address these challenges. CoS guides the LLMs to generate detailed reasoning steps for the input, then scrutinizes the reasoning process to ensure consistency with the final answer.
arXiv Detail & Related papers (2024-06-10T00:53:25Z)
Backdoor Removal for Generative Large Language Models [42.19147076519423]
generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. We present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs.
arXiv Detail & Related papers (2024-05-13T11:53:42Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
Instruction Backdoor Attacks Against Customized LLMs [37.92008159382539]
We propose the first instruction backdoor attacks against applications integrated with untrusted customized LLMs. Our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness. We propose two defense strategies and demonstrate their effectiveness in reducing such attacks.
arXiv Detail & Related papers (2024-02-14T13:47:35Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.