Related papers: Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

URL: http://arxiv.org/abs/2510.17062v1
Date: Mon, 20 Oct 2025 00:33:44 GMT
Title: Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation
Authors: Guoqing Luo, Iffat Maab, Lili Mou, Junichi Yamagishi,
Abstract summary: We investigate mechanisms within the thinking process behind social bias aggregation.<n>We uncover two failure patterns that drive social bias aggregation.<n>Our approach effectively reduces bias while maintaining or improving accuracy.
Score: 43.974424280422085
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.

Related papers

Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z)
Large Language Models Report Subjective Experience Under Self-Referential Processing [0.16623291199400023]
Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience.<n>We investigate one theoretically motivated condition under which such reports arise: self-referential processing.<n>We test whether this regime reliably shifts models toward first-person reports of subjective experience.
arXiv Detail & Related papers (2025-10-27T20:26:30Z)
Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models [0.0]
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks.<n>While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear.<n>We leverage the CLEAR-Bias benchmark to investigate the adversarial robustness of RLMs to bias elicitation.
arXiv Detail & Related papers (2025-07-03T17:01:53Z)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models [38.11937119873932]
Reasoning models often exhibit overthinking, characterized by redundant reasoning steps.<n>We identify internal bias elicited by the input question as a key trigger of such behavior.
arXiv Detail & Related papers (2025-05-22T09:35:52Z)
Implicit Bias-Like Patterns in Reasoning Models [0.5729426778193398]
Implicit biases refer to automatic mental processes that shape perceptions, judgments, and behaviors.<n>We present the Reasoning Model Implicit Association Test (RM-IAT) to study implicit bias-like processing in reasoning models.
arXiv Detail & Related papers (2025-03-14T16:40:02Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Covert Bias: The Severity of Social Views' Unalignment in Language Models Towards Implicit and Explicit Opinion [0.40964539027092917]
We evaluate the severity of bias toward a view by using a biased model in edge cases of excessive bias scenarios. Our findings reveal a discrepancy in LLM performance in identifying implicit and explicit opinions, with a general tendency of bias toward explicit opinions of opposing stances. The direct, incautious responses of the unaligned models suggest a need for further refinement of decisiveness.
arXiv Detail & Related papers (2024-08-15T15:23:00Z)
The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases. We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z)
On The Role of Reasoning in the Identification of Subtle Stereotypes in Natural Language [0.03749861135832073]
Large language models (LLMs) are trained on vast, uncurated datasets that contain various forms of biases and language reinforcing harmful stereotypes. It is essential to examine and address biases in language models, integrating fairness into their development to ensure that these models do not perpetuate social biases. This work firmly establishes reasoning as a critical component in automatic stereotype detection and is a first step towards stronger stereotype mitigation pipelines for LLMs.
arXiv Detail & Related papers (2023-07-24T15:12:13Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.