Adaptive Generation of Bias-Eliciting Questions for LLMs
- URL: http://arxiv.org/abs/2510.12857v1
- Date: Tue, 14 Oct 2025 13:08:10 GMT
- Title: Adaptive Generation of Bias-Eliciting Questions for LLMs
- Authors: Robin Staab, Jasper Dekoninck, Maximilian Baader, Martin Vechev,
- Abstract summary: Large language models (LLMs) are now widely deployed in user-facing applications, reaching hundreds of millions worldwide.<n>We introduce a counterfactual bias evaluation framework that automatically generates realistic, open-ended questions over sensitive attributes such as sex, race, or religion.<n>We also capture distinct response dimensions that are increasingly relevant in user interactions, such as asymmetric refusals and explicit acknowledgment of bias.
- Score: 18.608477560948003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are now widely deployed in user-facing applications, reaching hundreds of millions worldwide. As they become integrated into everyday tasks, growing reliance on their outputs raises significant concerns. In particular, users may unknowingly be exposed to model-inherent biases that systematically disadvantage or stereotype certain groups. However, existing bias benchmarks continue to rely on templated prompts or restrictive multiple-choice questions that are suggestive, simplistic, and fail to capture the complexity of real-world user interactions. In this work, we address this gap by introducing a counterfactual bias evaluation framework that automatically generates realistic, open-ended questions over sensitive attributes such as sex, race, or religion. By iteratively mutating and selecting bias-inducing questions, our approach systematically explores areas where models are most susceptible to biased behavior. Beyond detecting harmful biases, we also capture distinct response dimensions that are increasingly relevant in user interactions, such as asymmetric refusals and explicit acknowledgment of bias. Leveraging our framework, we construct CAB, a human-verified benchmark spanning diverse topics, designed to enable cross-model comparisons. Using CAB, we analyze a range of LLMs across multiple bias dimensions, revealing nuanced insights into how different models manifest bias. For instance, while GPT-5 outperforms other models, it nonetheless exhibits persistent biases in specific scenarios. These findings underscore the need for continual improvements to ensure fair model behavior.
Related papers
- A Comprehensive Study of Implicit and Explicit Biases in Large Language Models [1.0555164678638427]
This study highlights the need to address biases in Large Language Models amid growing generative AI.<n>We studied bias-specific benchmarks such as StereoSet and CrowSPairs to evaluate the existence of various biases in multiple generative models such as BERT and GPT 3.5.<n>Results indicated fine-tuned models struggle with gender biases but excelled at identifying and avoiding racial biases.
arXiv Detail & Related papers (2025-11-18T05:27:17Z) - Silenced Biases: The Dark Side LLMs Learned to Refuse [5.2630646053506345]
We introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space.<n>We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering.
arXiv Detail & Related papers (2025-11-05T11:24:50Z) - Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation [12.56588481992456]
Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior.<n>We introduce a novel and general augmentation framework that involves three plug-and-play steps.<n>We find that Large Language Models are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically.
arXiv Detail & Related papers (2025-10-27T23:05:12Z) - BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses [32.58830706120845]
Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance.<n>We introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques.<n>We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.
arXiv Detail & Related papers (2025-09-30T19:56:54Z) - BiasConnect: Investigating Bias Interactions in Text-to-Image Models [73.76853483463836]
We introduce BiasConnect, a novel tool designed to analyze and quantify bias interactions in Text-to-Image models.<n>Our method provides empirical estimates that indicate how other bias dimensions shift toward or away from an ideal distribution when a given bias is modified.<n>We demonstrate the utility of BiasConnect for selecting optimal bias mitigation axes, comparing different TTI models on the dependencies they learn, and understanding the amplification of intersectional societal biases in TTI models.
arXiv Detail & Related papers (2025-03-12T19:01:41Z) - Unmasking Conversational Bias in AI Multiagent Systems [1.0705399532413618]
biases that may arise in multi-agent systems involving generative models remain under-researched.<n>We present a framework designed to quantify biases within multi-agent systems of conversational Large Language Models.<n>The bias observed in the echo-chamber experiment remains undetected by current state-of-the-art bias detection methods.
arXiv Detail & Related papers (2025-01-24T09:10:02Z) - Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs [0.0]
Large Language Models (LLMs) are being adopted across a wide range of tasks.
Recent research indicates that LLMs can harbor implicit biases even when they pass explicit bias evaluations.
This study highlights that newer or larger language models do not automatically exhibit reduced bias.
arXiv Detail & Related papers (2024-10-13T03:43:18Z) - VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model [72.13121434085116]
We introduce VLBiasBench, a benchmark to evaluate biases in Large Vision-Language Models (LVLMs)<n>VLBiasBench features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status.<n>We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models.
arXiv Detail & Related papers (2024-06-20T10:56:59Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - Causality and Independence Enhancement for Biased Node Classification [56.38828085943763]
We propose a novel Causality and Independence Enhancement (CIE) framework, applicable to various graph neural networks (GNNs)
Our approach estimates causal and spurious features at the node representation level and mitigates the influence of spurious correlations.
Our approach CIE not only significantly enhances the performance of GNNs but outperforms state-of-the-art debiased node classification methods.
arXiv Detail & Related papers (2023-10-14T13:56:24Z) - Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models.
Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance.
We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z) - HANS, are you clever? Clever Hans Effect Analysis of Neural Systems [1.6267479602370545]
Large Language Models (It-LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively.
Several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models' abilities.
However, earlier works are demonstrating the presence of inherent "order bias" in It-LLMs, posing challenges to the appropriate evaluation.
arXiv Detail & Related papers (2023-09-21T20:52:18Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.