Biases in the Blind Spot: Detecting What LLMs Fail to Mention
- URL: http://arxiv.org/abs/2602.10117v3
- Date: Thu, 19 Feb 2026 03:56:30 GMT
- Title: Biases in the Blind Spot: Detecting What LLMs Fail to Mention
- Authors: Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu,
- Abstract summary: Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases.<n>We introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases.
- Score: 13.599496385950983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across seven LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.
Related papers
- Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs [0.5076419064097734]
Large language models (LLMs) are increasingly integrated into societal systems.<n>Traditional methods for mitigating bias often rely on data filtering or post-hoc output moderation.<n>We introduce a complete, end-to-end system that uses techniques from mechanistic interpretability to both identify and actively mitigate bias.
arXiv Detail & Related papers (2025-08-12T15:34:18Z) - A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [58.32070787537946]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z) - Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning [21.921684911834447]
We present the first systematic evaluation of social bias within large language models (LLMs)<n>We analyze both prediction accuracy and reasoning bias across a broad spectrum of models, including instruction-tuned and CoT-augmented variants of DeepSeek-R1 and ChatGPT.<n>We propose Answer Distribution as Bias Proxy (ADBP), a lightweight mitigation method that detects bias by tracking how model predictions change across incremental reasoning steps.
arXiv Detail & Related papers (2025-02-21T10:16:07Z) - How far can bias go? -- Tracing bias from pretraining data to alignment [54.51310112013655]
This study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs.<n>Our findings reveal that biases present in pre-training data are amplified in model outputs.
arXiv Detail & Related papers (2024-11-28T16:20:25Z) - Say My Name: a Model's Bias Discovery Framework [22.83182995962228]
We introduce Say My Name'' (SaMyNa), the first tool to identify biases within deep models semantically.<n>Unlike existing methods, our approach focuses on biases learned by the model.<n>Our method can disentangle task-related information and proposes itself as a tool to analyze biases.
arXiv Detail & Related papers (2024-08-18T18:50:59Z) - Evaluating Nuanced Bias in Large Language Model Free Response Answers [8.775925011558995]
We identify several kinds of nuanced bias in free text that cannot be identified by multiple choice tests.
We present a semi-automated pipeline for detecting these types of bias by first eliminating answers that can be automatically classified as unbiased.
arXiv Detail & Related papers (2024-07-11T19:58:13Z) - Projective Methods for Mitigating Gender Bias in Pre-trained Language Models [10.418595661963062]
Projective methods are fast to implement, use a small number of saved parameters, and make no updates to the existing model parameters.
We find that projective methods can be effective at both intrinsic bias and downstream bias mitigation, but that the two outcomes are not necessarily correlated.
arXiv Detail & Related papers (2024-03-27T17:49:31Z) - Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction [56.17020601803071]
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction.
This paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias.
arXiv Detail & Related papers (2024-03-15T02:04:35Z) - Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models.
Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance.
We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias [33.99768156365231]
We introduce a causal formulation for bias measurement in generative language models.<n>We propose a benchmark called OccuGender, with a bias-measuring procedure to investigate occupational gender bias.<n>The results show that these models exhibit substantial occupational gender bias.
arXiv Detail & Related papers (2022-12-20T22:41:24Z) - ADEPT: A DEbiasing PrompT Framework [64.54665501064659]
Finetuning is an applicable approach for debiasing contextualized word embeddings.<n> discrete prompts with semantic meanings have shown to be effective in debiasing tasks.<n>We propose ADEPT, a method to debias PLMs using prompt tuning while maintaining the delicate balance between removing biases and ensuring representation ability.
arXiv Detail & Related papers (2022-11-10T08:41:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.