Related papers: Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

URL: http://arxiv.org/abs/2508.20083v1
Date: Wed, 27 Aug 2025 17:49:28 GMT
Title: Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning
Authors: Yanbo Dai, Zhenlan Ji, Zongjie Li, Kuan Li, Shuai Wang,
Abstract summary: Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs)<n>This paper uncovers that such attacks could be mitigated by the strong textitself-correction ability (SCA) of modern LLMs.<n>We introduce textscDisarmRAG, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs.
Score: 14.419943772894754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong \textit{self-correction ability (SCA)} of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems. In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce \textsc{DisarmRAG}, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into the context provided to the generator, thereby bypassing the SCA. To this end, we present a contrastive-learning-based model editing technique that performs localized and stealthy edits, ensuring the retriever returns a malicious instruction only for specific victim queries while preserving benign retrieval behavior. To further strengthen the attack, we design an iterative co-optimization framework that automatically discovers robust instructions capable of bypassing prompt-based defenses. We extensively evaluate DisarmRAG across six LLMs and three QA benchmarks. Our results show near-perfect retrieval of malicious instructions, which successfully suppress SCA and achieve attack success rates exceeding 90\% under diverse defensive prompts. Also, the edited retriever remains stealthy under several detection methods, highlighting the urgent need for retriever-centric defenses.

Related papers

RAGPart & RAGMask: Retrieval-Stage Defenses Against Corpus Poisoning in Retrieval-Augmented Generation [43.85099769473328]
Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to enhance large language models.<n>Recent studies have exposed a critical vulnerability in RAG pipelines corpus poisoning where adversaries inject malicious documents into the retrieval corpus to manipulate model outputs.<n>We propose two complementary retrieval-stage defenses: RAGPart and RAGMask.
arXiv Detail & Related papers (2025-12-30T14:43:57Z)
The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z)
Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems [11.812488957698038]
Large language models (LLMs) are reshaping numerous facets of our daily lives, leading widespread adoption as web-based services.<n>Retrieval-Augmented Generation (RAG) has emerged as a promising direction by generating responses grounded in external knowledge sources.<n>Recent studies demonstrate the vulnerability of RAG, such as knowledge corruption attacks by injecting misleading information.<n>In this work, we introduce RAGDefender, a resource-efficient defense mechanism against knowledge corruption.
arXiv Detail & Related papers (2025-11-03T06:39:58Z)
RIPRAG: Hack a Black-box Retrieval-Augmented Generation Question-Answering System with Reinforcement Learning [23.957879891712306]
We propose an end-to-end attack pipeline that treats the target RAG system as a black box.<n>We demonstrate that this method can effectively execute poisoning attacks against most complex RAG systems.
arXiv Detail & Related papers (2025-10-11T04:23:20Z)
Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z)
Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z)
The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems [101.68501850486179]
We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities.<n>This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top-$k$ candidate set.<n>We propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG.
arXiv Detail & Related papers (2025-05-24T08:19:25Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Chain-of-Thought Poisoning Attacks against R1-based Retrieval-Augmented Generation Systems [39.05753852489526]
Existing adversarial attack methods typically exploit knowledge base poisoning to probe the vulnerabilities of RAG systems.<n>This paper uses reasoning process templates from R1-based RAG systems to wrap erroneous knowledge into adversarial documents, and injects them into the knowledge base to attack RAG systems.<n>The key idea of our approach is that adversarial documents, by simulating the chain-of-thought patterns aligned with the model's training signals, may be misinterpreted by the model as authentic historical reasoning processes.
arXiv Detail & Related papers (2025-05-22T08:22:46Z)
POISONCRAFT: Practical Poisoning of Retrieval-Augmented Generation for Large Language Models [4.620537391830117]
Large language models (LLMs) are susceptible to hallucinations, which can lead to incorrect or misleading outputs.<n>Retrieval-augmented generation (RAG) is a promising approach to mitigate hallucinations by leveraging external knowledge sources.<n>In this paper, we study a poisoning attack on RAG systems named POISONCRAFT, which can mislead the model to refer to fraudulent websites.
arXiv Detail & Related papers (2025-05-10T09:36:28Z)
Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs [8.09404178079053]
Retrieval-Augmented Generation (RAG) integrates Large Language Models (LLMs) with external knowledge bases, improving output quality while introducing new security risks.<n>Existing studies on RAG vulnerabilities typically focus on exploiting the retrieval mechanism to inject erroneous knowledge or malicious texts, inducing incorrect outputs.<n>In this paper, we uncover a novel vulnerability: the safety guardrails of LLMs, while designed for protection, can also be exploited as an attack vector by adversaries. Building on this vulnerability, we propose MutedRAG, a novel denial-of-service attack that reversely leverages the guardrails to undermine the availability of
arXiv Detail & Related papers (2025-04-30T14:18:11Z)
Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models [0.0]
Retrieval Augmented Generation (RAG) addresses this issue by combining Large Language Models with up-to-date information retrieval. This paper investigates prompt injection attacks on RAG, focusing on malicious objectives beyond misinformation. We build upon existing corpus poisoning techniques and propose a novel backdoor attack aimed at the fine-tuning process of the dense retriever component.
arXiv Detail & Related papers (2024-10-18T14:02:34Z)
Corpus Poisoning via Approximate Greedy Gradient Descent [48.5847914481222]
We propose Approximate Greedy Gradient Descent, a new attack on dense retrieval systems based on the widely used HotFlip method for generating adversarial passages. We show that our method achieves a high attack success rate on several datasets and using several retrievers, and can generalize to unseen queries and new domains.
arXiv Detail & Related papers (2024-06-07T17:02:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.