Related papers: Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models

Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models

URL: http://arxiv.org/abs/2509.26584v1
Date: Tue, 30 Sep 2025 17:42:35 GMT
Title: Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models
Authors: Matheus Vinicius da Silva de Oliveira, Jonathan de Andrade Silva, Awdren de Lima Fontao,
Abstract summary: This study conducts fairness testing through metamorphic testing (MT), introducing controlled demographic perturbations in prompts to assess fairness in sentiment analysis performed by three Small Language Models (SLMs)<n>Results show that minor demographic variations can break up to one third of metamorphic relations (MRs)<n>A detailed analysis of these failures reveals a consistent bias hierarchy, with perturbations involving racial cues being the predominant cause of the violations.
Score: 0.1876920697241348
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are widely used across multiple domains but continue to raise concerns regarding security and fairness. Beyond known attack vectors such as data poisoning and prompt injection, LLMs are also vulnerable to fairness bugs. These refer to unintended behaviors influenced by sensitive demographic cues (e.g., race or sexual orientation) that should not affect outcomes. Another key issue is hallucination, where models generate plausible yet false information. Retrieval-Augmented Generation (RAG) has emerged as a strategy to mitigate hallucinations by combining external retrieval with text generation. However, its adoption raises new fairness concerns, as the retrieved content itself may surface or amplify bias. This study conducts fairness testing through metamorphic testing (MT), introducing controlled demographic perturbations in prompts to assess fairness in sentiment analysis performed by three Small Language Models (SLMs) hosted on HuggingFace (Llama-3.2-3B-Instruct, Mistral-7B-Instruct-v0.3, and Llama-3.1-Nemotron-8B), each integrated into a RAG pipeline. Results show that minor demographic variations can break up to one third of metamorphic relations (MRs). A detailed analysis of these failures reveals a consistent bias hierarchy, with perturbations involving racial cues being the predominant cause of the violations. In addition to offering a comparative evaluation, this work reinforces that the retrieval component in RAG must be carefully curated to prevent bias amplification. The findings serve as a practical alert for developers, testers and small organizations aiming to adopt accessible SLMs without compromising fairness or reliability.

Related papers

Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts [7.344577590113121]
Social biases inherent in large language models (LLMs) raise significant fairness concerns.<n>This work focuses on evaluating and understanding the social bias implications of RAG.
arXiv Detail & Related papers (2026-02-10T06:27:56Z)
Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization [38.469173375694076]
This paper systematically analyzes the root causes of hallucinations in Multimodal Large Language Models (MLLMs)<n>It identifies three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where NTK similarity causes false associations and unstable parameter updates.<n> Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.
arXiv Detail & Related papers (2026-01-09T07:59:18Z)
Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation [12.56588481992456]
Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior.<n>We introduce a novel and general augmentation framework that involves three plug-and-play steps.<n>We find that Large Language Models are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically.
arXiv Detail & Related papers (2025-10-27T23:05:12Z)
Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z)
Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT [2.380039717474099]
Large Language Models (LLMs) have made significant strides in Natural Language Processing but remain vulnerable to fairness-related issues.<n>This paper introduces a metamorphic testing approach to systematically identify fairness bugs in LLMs.
arXiv Detail & Related papers (2025-04-04T21:04:14Z)
How far can bias go? -- Tracing bias from pretraining data to alignment [54.51310112013655]
This study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs.<n>Our findings reveal that biases present in pre-training data are amplified in model outputs.
arXiv Detail & Related papers (2024-11-28T16:20:25Z)
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework [77.45983464131977]
We focus on how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications.<n>Our research identifies two critical latent factors affecting RAG's confidence in its predictions.<n>We develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers.
arXiv Detail & Related papers (2024-09-24T14:52:14Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations [63.52709761339949]
We first contribute a dedicated dataset called the Fair Forgery Detection (FairFD) dataset, where we prove the racial bias of public state-of-the-art (SOTA) methods.<n>We design novel metrics including Approach Averaged Metric and Utility Regularized Metric, which can avoid deceptive results.<n>We also present an effective and robust post-processing technique, Bias Pruning with Fair Activations (BPFA), which improves fairness without requiring retraining or weight updates.
arXiv Detail & Related papers (2024-07-19T14:53:18Z)
DispaRisk: Auditing Fairness Through Usable Information [21.521208250966918]
DispaRisk is a framework designed to assess the potential risks of disparities in datasets during the initial stages of the Machine Learning pipeline.<n>Our findings demonstrate DispaRisk's capabilities to identify datasets with a high risk of discrimination, detect model families prone to biases within an ML pipeline, and enhance the explainability of these bias risks.
arXiv Detail & Related papers (2024-05-20T20:56:01Z)
Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study [61.74571814707054]
We evaluate whether every generated sentence is grounded in retrieved documents or the model's pre-training data. Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded. Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations.
arXiv Detail & Related papers (2024-04-10T14:50:10Z)
Fairness-enhancing mixed effects deep learning improves fairness on in- and out-of-distribution clustered (non-iid) data [6.596656267996196]
We propose the Fair Mixed Effects Deep Learning (Fair MEDL) framework.<n>This framework quantifies cluster-invariant fixed effects (FE) and cluster-specific random effects (RE) through: 1) a cluster adversary for learning invariant FE, 2) a Bayesian neural network for RE, and 3) a mixing function combining FE and RE for final predictions.<n>Fair MEDL framework improves fairness by 86.4% for Age, 64.9% for Race, 57.8% for Sex, and 36.2% for Marital status, while maintaining robust predictive performance.
arXiv Detail & Related papers (2023-10-04T20:18:45Z)
Uncovering Bias in Face Generation Models [0.0]
Recent advancements in GANs and diffusion models have enabled the creation of high-resolution, hyper-realistic images. These models may misrepresent certain social groups and present bias. This work is a novel analysis covering and embedding spaces for fine-grained understanding of bias over three approaches.
arXiv Detail & Related papers (2023-02-22T18:57:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.