Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models
- URL: http://arxiv.org/abs/2507.02799v1
- Date: Thu, 03 Jul 2025 17:01:53 GMT
- Title: Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models
- Authors: Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia,
- Abstract summary: Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks.<n>While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear.<n>We leverage the CLEAR-Bias benchmark to investigate the adversarial robustness of RLMs to bias elicitation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.
Related papers
- Consistency of Large Reasoning Models Under Multi-Turn Attacks [8.86745721473138]
We evaluate nine frontier reasoning models under adversarial attacks.<n>Most reasoning models studied significantly outperform instruction-tuned baselines.<n>Confidence-Aware Response Generation fails for reasoning models due to overconfidence induced by extended reasoning traces.
arXiv Detail & Related papers (2026-02-13T16:58:47Z) - Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models [15.849480549367684]
We propose DAGVul, a novel framework that models vulnerability reasoning as a Directed Acyclic Graph (DAG) generation task.<n>By further introducing Reinforcement Learning with Verifiable Rewards (RLVR), we align model reasoning trace with program-intrinsic logic.<n>Our framework improves the reasoning F1-score by an average of 18.9% over all the baselines.
arXiv Detail & Related papers (2026-02-06T13:19:45Z) - Large Reasoning Models Learn Better Alignment from Flawed Thinking [56.08883934423522]
Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer.<n>We propose RECAP, a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T14:15:43Z) - AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z) - FairReason: Balancing Reasoning and Social Bias in MLLMs [50.618158642714505]
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities.<n>Recent studies explore advanced prompting schemes and post-training fine-tuning to push their reasoning ability further.
arXiv Detail & Related papers (2025-07-30T19:57:22Z) - Does More Inference-Time Compute Really Help Robustness? [50.47666612618054]
We show that small-scale, open-source models can benefit from inference-time scaling.<n>We identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law.<n>We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.
arXiv Detail & Related papers (2025-07-21T18:08:38Z) - Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning [33.30315111732609]
Chain of Thought (CoT) reasoning has demonstrated remarkable deep reasoning capabilities.<n>However, its reliability is often undermined by the accumulation of errors in intermediate steps.<n>This paper introduces an approach to calibrate the CoT reasoning accuracy by leveraging the model's intrinsic veracity encoding.
arXiv Detail & Related papers (2025-07-14T07:41:35Z) - Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z) - A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z) - Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z) - Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents.<n>Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses.<n>This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z) - Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning [12.559028963968247]
We investigate the crucial relationship between a model's reasoning ability and fairness.<n>We find that larger models with stronger reasoning abilities exhibit substantially lower stereotypical bias.<n>We introduce ReGiFT, a novel approach that extracts structured reasoning traces from advanced reasoning models and infuses them into models that lack such capabilities.
arXiv Detail & Related papers (2025-04-08T03:21:51Z) - Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps [39.759594479826454]
We explore how vulnerable are reasoning models to subtle errors in their input reasoning chains.<n>We introduce "Compromising Thought" (CPT), a vulnerability where models presented with reasoning tokens containing manipulated calculation results tend to ignore correct reasoning steps and adopt incorrect results instead.<n>Our work enhances understanding of reasoning robustness and highlights security considerations for reasoning-intensive applications.
arXiv Detail & Related papers (2025-03-25T03:43:11Z) - Implicit Bias-Like Patterns in Reasoning Models [0.5729426778193398]
Implicit biases refer to automatic mental processes that shape perceptions, judgments, and behaviors.<n>We present the Reasoning Model Implicit Association Test (RM-IAT) to study implicit bias-like processing in reasoning models.
arXiv Detail & Related papers (2025-03-14T16:40:02Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.