Your RAG is Unfair: Exposing Fairness Vulnerabilities in Retrieval-Augmented Generation via Backdoor Attacks
- URL: http://arxiv.org/abs/2509.22486v1
- Date: Fri, 26 Sep 2025 15:33:36 GMT
- Title: Your RAG is Unfair: Exposing Fairness Vulnerabilities in Retrieval-Augmented Generation via Backdoor Attacks
- Authors: Gaurav Bagwe, Saket S. Chaturvedi, Xiaolong Ma, Xiaoyong Yuan, Kuang-Ching Wang, Lan Zhang,
- Abstract summary: Retrieval-augmented generation (RAG) enhances factual grounding by integrating retrieval mechanisms with generative models.<n>This paper introduces BiasRAG, a systematic framework that exposes fairness vulnerabilities in RAG through a two-phase backdoor attack.<n>In the pre-training phase, the query encoder is compromised to align the target group with the intended social bias, ensuring long-term persistence.<n>In the post-deployment phase, adversarial documents are injected into knowledge bases to reinforce the backdoor.
- Score: 13.32144267469022
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Retrieval-augmented generation (RAG) enhances factual grounding by integrating retrieval mechanisms with generative models but introduces new attack surfaces, particularly through backdoor attacks. While prior research has largely focused on disinformation threats, fairness vulnerabilities remain underexplored. Unlike conventional backdoors that rely on direct trigger-to-target mappings, fairness-driven attacks exploit the interaction between retrieval and generation models, manipulating semantic relationships between target groups and social biases to establish a persistent and covert influence on content generation. This paper introduces BiasRAG, a systematic framework that exposes fairness vulnerabilities in RAG through a two-phase backdoor attack. During the pre-training phase, the query encoder is compromised to align the target group with the intended social bias, ensuring long-term persistence. In the post-deployment phase, adversarial documents are injected into knowledge bases to reinforce the backdoor, subtly influencing retrieved content while remaining undetectable under standard fairness evaluations. Together, BiasRAG ensures precise target alignment over sensitive attributes, stealthy execution, and resilience. Empirical evaluations demonstrate that BiasRAG achieves high attack success rates while preserving contextual relevance and utility, establishing a persistent and evolving threat to fairness in RAG.
Related papers
- Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation [50.87199039334856]
Retrieval-Augmented Generation (RAG) has become a cornerstone of knowledge-intensive applications.<n>Recent studies show that knowledge-extraction attacks can recover sensitive knowledge-base content through maliciously crafted queries.<n>We introduce the first systematic benchmark for knowledge-extraction attacks on RAG systems.
arXiv Detail & Related papers (2026-02-10T01:27:46Z) - Confundo: Learning to Generate Robust Poison for Practical RAG Systems [19.77771071590713]
Confundo is a learning-to-poison framework that fine-tunes a large language model as a poison generator to achieve high effectiveness, robustness, and stealthiness.<n>We show how Confundo consistently outperforms a wide range of purpose-built attacks across datasets and RAG configurations.<n>We also present a defensive use case that protects web content from unauthorized incorporation into RAG systems via scraping.
arXiv Detail & Related papers (2026-02-06T11:19:49Z) - INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems [70.37731999972785]
In this paper, we propose Infection-Aware Guard, INFA-Guard, a novel defense framework that explicitly identifies and addresses infected agents as a distinct threat category.<n>During remediation, INFA-Guard replaces attackers and rehabilitates infected ones, avoiding malicious propagation while preserving topological integrity.
arXiv Detail & Related papers (2026-01-21T05:27:08Z) - Noise-Aware and Dynamically Adaptive Federated Defense Framework for SAR Image Target Recognition [41.71964972109509]
Federated learning (FL) provides an emerging computational intelligence paradigm for SAR image target recognition.<n>FL confronts critical security risks, where malicious clients can exploit SAR's multiplicative speckle noise to conceal backdoor triggers.<n>We propose NADAFD, a noise-aware and dynamically adaptive federated defense framework that integrates frequency-domain, spatial-domain, and client-behavior analyses.
arXiv Detail & Related papers (2025-12-31T17:24:15Z) - RAGFort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation [49.12766233674338]
Retrieval-Augmented Generation systems face growing threats from reconstruction attacks that aggregate model responses to replicate knowledge bases.<n>We propose RAGFort, a structure-aware dual-module defense combining "contrastive reindexing" for inter-class isolation and "constrained cascade generation" for intra-class protection.
arXiv Detail & Related papers (2025-11-13T09:34:46Z) - Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification [52.63017280231648]
Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking.<n>Person ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions.<n>We propose a dual-invariant defense framework composed of two main phases.
arXiv Detail & Related papers (2025-11-13T03:56:40Z) - The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems [101.68501850486179]
We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities.<n>This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top-$k$ candidate set.<n>We propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG.
arXiv Detail & Related papers (2025-05-24T08:19:25Z) - Safety Devolution in AI Agents [56.482973617087254]
This study investigates how expanding retrieval access affects model reliability, bias propagation, and harmful content generation.<n>Retrieval-augmented agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval.<n>These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-augmented and increasingly autonomous AI systems.
arXiv Detail & Related papers (2025-05-20T11:21:40Z) - Defending the Edge: Representative-Attention for Mitigating Backdoor Attacks in Federated Learning [7.808916974942399]
heterogeneous edge devices produce diverse, non-independent, and identically distributed (non-IID) data.<n>We propose a novel representative-attention-based defense mechanism, named FeRA, to distinguish benign from malicious clients.<n>Our evaluation demonstrates FeRA's robustness across various FL scenarios, including challenging non-IID data distributions typical of edge devices.
arXiv Detail & Related papers (2025-05-15T13:44:32Z) - Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z) - ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation [91.20492150248106]
We investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases.<n>We propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs.<n> Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory.
arXiv Detail & Related papers (2025-02-21T15:50:41Z) - FlippedRAG: Black-Box Opinion Manipulation Adversarial Attacks to Retrieval-Augmented Generation Models [22.35026334463735]
We propose FlippedRAG, a transfer-based adversarial attack against black-box RAG systems.<n>FlippedRAG achieves on average a 50% directional shift in the opinion of RAG-generated responses.<n>These results highlight an urgent need for developing innovative defensive solutions to ensure the security and trustworthiness of RAG systems.
arXiv Detail & Related papers (2025-01-06T12:24:57Z) - TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation [31.231916859341865]
TrustRAG is a framework that systematically filters malicious and irrelevant content before it is retrieved for generation.<n>TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance.
arXiv Detail & Related papers (2025-01-01T15:57:34Z) - Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs.
We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.