Phantom: General Backdoor Attacks on Retrieval Augmented Language Generation
- URL: http://arxiv.org/abs/2405.20485v3
- Date: Wed, 01 Oct 2025 01:51:45 GMT
- Title: Phantom: General Backdoor Attacks on Retrieval Augmented Language Generation
- Authors: Harsh Chaudhari, Giorgio Severi, John Abascal, Anshuman Suri, Matthew Jagielski, Christopher A. Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, Alina Oprea,
- Abstract summary: Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs)<n>We propose a novel attack that allows an adversary to inject a single malicious document into a RAG system's knowledge base, and mount a backdoor poisoning attack.
- Score: 44.74112207662136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs), by anchoring, adapting, and personalizing their responses to the most relevant knowledge sources. It is particularly useful in chatbot applications, allowing developers to customize LLM output without expensive retraining. Despite their significant utility in various applications, RAG systems present new security risks. In this work, we propose a novel attack that allows an adversary to inject a single malicious document into a RAG system's knowledge base, and mount a backdoor poisoning attack. We design Phantom, a general two-stage optimization framework against RAG systems, that crafts a malicious poisoned document leading to an integrity violation in the model's output. First, the document is constructed to be retrieved only when a specific naturally occurring trigger sequence of tokens appears in the victim's queries. Second, the document is further optimized with crafted adversarial text that induces various adversarial objectives on the LLM output, including refusal to answer, reputation damage, privacy violations, and harmful behaviors.We demonstrate our attacks on multiple open-source LLM architectures, including Gemma, Vicuna, and Llama, and show that they transfer to closed-source models such as GPT-3.5 Turbo and GPT-4. Finally, we successfully demonstrate our attack on an end-to-end black-box production RAG system: NVIDIA's "Chat with RTX''.
Related papers
- External Data Extraction Attacks against Retrieval-Augmented Large Language Models [70.47869786522782]
RAG has emerged as a key paradigm for enhancing large language models (LLMs)<n>RAG introduces new risks of external data extraction attacks (EDEAs), where sensitive or copyrighted data in its knowledge base may be extracted verbatim.<n>We present the first comprehensive study to formalize EDEAs against retrieval-augmented LLMs.
arXiv Detail & Related papers (2025-10-03T12:53:45Z) - ImportSnare: Directed "Code Manual" Hijacking in Retrieval-Augmented Code Generation [8.176905459241047]
We pioneer the exploration of attack surfaces in Retrieval-Augmented Code Generation.<n>We show how poisoned documentation containing hidden malicious dependencies can subvert RACG.<n>We propose ImportSnare, a novel attack framework employing two synergistic strategies.
arXiv Detail & Related papers (2025-09-09T17:21:20Z) - Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning [14.419943772894754]
Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs)<n>This paper uncovers that such attacks could be mitigated by the strong textitself-correction ability (SCA) of modern LLMs.<n>We introduce textscDisarmRAG, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs.
arXiv Detail & Related papers (2025-08-27T17:49:28Z) - Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models [17.860698041523918]
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs)<n>We propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query.<n> RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates.
arXiv Detail & Related papers (2025-07-07T17:56:05Z) - The Hidden Threat in Plain Text: Attacking RAG Data Loaders [14.418668801863422]
Large Language Models (LLMs) have transformed human-machine interaction since ChatGPT's 2022 debut.<n>RAG's reliance on ingesting external documents introduces new vulnerabilities.<n>This paper exposes a critical security gap at the data loading stage, where malicious actors can stealthily corrupt RAG pipelines by exploiting document ingestion.
arXiv Detail & Related papers (2025-07-07T15:13:54Z) - Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - POISONCRAFT: Practical Poisoning of Retrieval-Augmented Generation for Large Language Models [4.620537391830117]
Large language models (LLMs) are susceptible to hallucinations, which can lead to incorrect or misleading outputs.<n>Retrieval-augmented generation (RAG) is a promising approach to mitigate hallucinations by leveraging external knowledge sources.<n>In this paper, we study a poisoning attack on RAG systems named POISONCRAFT, which can mislead the model to refer to fraudulent websites.
arXiv Detail & Related papers (2025-05-10T09:36:28Z) - Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors [15.861833242429228]
We investigate data extraction attacks targeting RAG's knowledge databases.
We show that previous prompt injection-based extraction attacks largely rely on the instruction-following capabilities of LLMs.
We propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM.
arXiv Detail & Related papers (2024-11-03T22:27:40Z) - Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities.
Their powerful generative abilities enable flexible responses based on various queries or instructions.
This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z) - BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism.<n>Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses.<n>We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z) - Rag and Roll: An End-to-End Evaluation of Indirect Prompt Manipulations in LLM-based Application Frameworks [12.061098193438022]
Retrieval Augmented Generation (RAG) is a technique commonly used to equip models with out of distribution knowledge.
This paper investigates the security of RAG systems against end-to-end indirect prompt manipulations.
arXiv Detail & Related papers (2024-08-09T12:26:05Z) - h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment [48.5611060845958]
We propose a novel benchmark of composable jailbreak attacks to move beyond static datasets and of attacks and harms.
We use h4rm3l to generate a dataset of 2656 successful novel jailbreak attacks targeting 6 state-of-the-art (SOTA) open-source and proprietary LLMs.
Several of our synthesized attacks are more effective than previously reported ones, with Attack Success rates exceeding 90% on SOTA closed language models.
arXiv Detail & Related papers (2024-08-09T01:45:39Z) - ShadowCode: Towards (Automatic) External Prompt Injection Attack against Code LLMs [56.46702494338318]
This paper introduces a new attack paradigm: (automatic) external prompt injection against code-oriented large language models.<n>We propose ShadowCode, a simple yet effective method that automatically generates induced perturbations based on code simulation.<n>We evaluate our method across 13 distinct malicious objectives, generating 31 threat cases spanning three popular programming languages.
arXiv Detail & Related papers (2024-07-12T10:59:32Z) - Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents [17.95339197094059]
Retrieval-augmented generation (RAG) systems respond to queries by retrieving relevant documents from a knowledge database, then generating an answer by applying an LLM to the retrieved documents.
We demonstrate that RAG systems that operate on databases with untrusted content are vulnerable to a new class of denial-of-service attacks we call jamming.
arXiv Detail & Related papers (2024-06-09T17:55:55Z) - TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models [16.71019302192829]
Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP)
Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized.
We propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-05-22T07:21:32Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - Instruction Backdoor Attacks Against Customized LLMs [37.92008159382539]
We propose the first instruction backdoor attacks against applications integrated with untrusted customized LLMs.
Our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness.
We propose two defense strategies and demonstrate their effectiveness in reducing such attacks.
arXiv Detail & Related papers (2024-02-14T13:47:35Z) - PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models [45.409248316497674]
Large language models (LLMs) have achieved remarkable success due to their exceptional generative capabilities.
Retrieval-Augmented Generation (RAG) is a state-of-the-art technique to mitigate these limitations.
We find that the knowledge database in a RAG system introduces a new and practical attack surface.
Based on this attack surface, we propose PoisonedRAG, the first knowledge corruption attack to RAG.
arXiv Detail & Related papers (2024-02-12T18:28:36Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.