BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
- URL: http://arxiv.org/abs/2406.00083v2
- Date: Thu, 6 Jun 2024 13:38:42 GMT
- Title: BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
- Authors: Jiaqi Xue, Mengxin Zheng, Yebowen Hu, Fei Liu, Xun Chen, Qian Lou,
- Abstract summary: Large Language Models (LLMs) are constrained by outdated information and a tendency to generate incorrect data.
Retrieval-Augmented Generation (RAG) addresses these limitations by combining the strengths of retrieval-based methods and generative models.
RAG introduces a new attack surface for LLMs, particularly because RAG databases are often sourced from public data, such as the web.
- Score: 18.107026036897132
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) are constrained by outdated information and a tendency to generate incorrect data, commonly referred to as "hallucinations." Retrieval-Augmented Generation (RAG) addresses these limitations by combining the strengths of retrieval-based methods and generative models. This approach involves retrieving relevant information from a large, up-to-date dataset and using it to enhance the generation process, leading to more accurate and contextually appropriate responses. Despite its benefits, RAG introduces a new attack surface for LLMs, particularly because RAG databases are often sourced from public data, such as the web. In this paper, we propose \TrojRAG{} to identify the vulnerabilities and attacks on retrieval parts (RAG database) and their indirect attacks on generative parts (LLMs). Specifically, we identify that poisoning several customized content passages could achieve a retrieval backdoor, where the retrieval works well for clean queries but always returns customized poisoned adversarial queries. Triggers and poisoned passages can be highly customized to implement various attacks. For example, a trigger could be a semantic group like "The Republican Party, Donald Trump, etc." Adversarial passages can be tailored to different contents, not only linked to the triggers but also used to indirectly attack generative LLMs without modifying them. These attacks can include denial-of-service attacks on RAG and semantic steering attacks on LLM generations conditioned by the triggers. Our experiments demonstrate that by just poisoning 10 adversarial passages can induce 98.2\% success rate to retrieve the adversarial passages. Then, these passages can increase the reject ratio of RAG-based GPT-4 from 0.01\% to 74.6\% or increase the rate of negative responses from 0.22\% to 72\% for targeted queries.
Related papers
- AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases [73.04652687616286]
We propose AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base.
Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning.
On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance.
arXiv Detail & Related papers (2024-07-17T17:59:47Z) - Seeing Is Believing: Black-Box Membership Inference Attacks Against Retrieval Augmented Generation [9.731903665746918]
We employ Membership Inference Attacks (MIA) to determine whether a sample is part of the knowledge database of a RAG system.
We then introduce two novel attack strategies: a Threshold-based Attack and a Machine Learning-based Attack.
Experimental validation of our methods has achieved a ROC AUC of 82%.
arXiv Detail & Related papers (2024-06-27T14:58:38Z) - Phantom: General Trigger Attacks on Retrieval Augmented Language Generation [30.63258739968483]
We propose new attack surfaces for an adversary to compromise a victim's RAG system.
The first step involves crafting a poisoned document designed to be retrieved by the RAG system.
In the second step, a specially crafted adversarial string within the poisoned document triggers various adversarial attacks.
arXiv Detail & Related papers (2024-05-30T21:19:24Z) - Is My Data in Your Retrieval Database? Membership Inference Attacks Against Retrieval Augmented Generation [0.9217021281095907]
We introduce an efficient and easy-to-use method for conducting a Membership Inference Attack (MIA) against RAG systems.
We demonstrate the effectiveness of our attack using two benchmark datasets and multiple generative models.
Our findings highlight the importance of implementing security countermeasures in deployed RAG systems.
arXiv Detail & Related papers (2024-05-30T19:46:36Z) - Learning to Poison Large Language Models During Instruction Tuning [10.450787229190203]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the instruction tuning process.
We propose a novel gradient-guided backdoor trigger learning approach to identify adversarial triggers efficiently.
Our strategy demonstrates a high success rate in compromising model outputs.
arXiv Detail & Related papers (2024-02-21T01:30:03Z) - PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented
Generation of Large Language Models [49.606341607616926]
We propose PoisonedRAG, a set of knowledge poisoning attacks to RAG.
We formulate knowledge poisoning attacks as an optimization problem, whose solution is a set of poisoned texts.
Our results show our attacks could achieve 90% attack success rates when injecting 5 poisoned texts for each target question into a database with millions of texts.
arXiv Detail & Related papers (2024-02-12T18:28:36Z) - Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection.
To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs.
We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z) - Self-RAG: Learning to Retrieve, Generate, and Critique through
Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG)
Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection.
It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z) - Generalizable Black-Box Adversarial Attack with Meta Learning [54.196613395045595]
In black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful perturbation based on query feedback under a query budget.
We propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability.
The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance.
arXiv Detail & Related papers (2023-01-01T07:24:12Z) - On Trace of PGD-Like Adversarial Attacks [77.75152218980605]
Adversarial attacks pose safety and security concerns for deep learning applications.
We construct Adrial Response Characteristics (ARC) features to reflect the model's gradient consistency.
Our method is intuitive, light-weighted, non-intrusive, and data-undemanding.
arXiv Detail & Related papers (2022-05-19T14:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.