Related papers: Reasoning Models Are More Easily Gaslighted Than You Think

Reasoning Models Are More Easily Gaslighted Than You Think

URL: http://arxiv.org/abs/2506.09677v1
Date: Wed, 11 Jun 2025 12:52:25 GMT
Title: Reasoning Models Are More Easily Gaslighted Than You Think
Authors: Bin Zhu, Hailong Yin, Jingjing Chen, Yu-Gang Jiang,
Abstract summary: We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
Score: 85.84943447589511
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models' susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings reveal fundamental limitations in the robustness of reasoning models, highlighting the gap between step-by-step reasoning and belief persistence.

Related papers

RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback [57.967762383794806]
RefCritic is a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards.<n>We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks.
arXiv Detail & Related papers (2025-07-20T16:19:51Z)
Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z)
Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z)
Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know? [7.423494663010787]
Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks.<n>Like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect.<n>Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications.
arXiv Detail & Related papers (2025-06-22T21:46:42Z)
ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models [9.30148520355391]
We present ReasonGRM, a three-stage generative reward modeling framework.<n>In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths.<n>In the second stage, $Rstar$, which scores reasoning paths based on their generation likelihood.<n>In the final stage, the model is further refined through reinforcement learning to enhance its preference discrimination capabilities.
arXiv Detail & Related papers (2025-06-20T03:10:52Z)
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning [99.645427839457]
Self-Play Critic (SPC) is a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games.<n>SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" and a "critic"
arXiv Detail & Related papers (2025-04-27T08:45:06Z)
Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning [12.559028963968247]
We investigate the crucial relationship between a model's reasoning ability and fairness.<n>We find that larger models with stronger reasoning abilities exhibit substantially lower stereotypical bias.<n>We introduce ReGiFT, a novel approach that extracts structured reasoning traces from advanced reasoning models and infuses them into models that lack such capabilities.
arXiv Detail & Related papers (2025-04-08T03:21:51Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities [39.68147391225923]
We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs)<n>This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents.<n>We introduce an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis.
arXiv Detail & Related papers (2025-02-25T03:29:53Z)
Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT [0.0]
This study introduces a novel benchmark that integrates multi-image reasoning tasks with rejection-based evaluation and positional bias detection.<n>We applied this benchmark to assess Grok 3, ChatGPT-4o, ChatGPT-o1, Gemini 2.0 Flash Experimental, DeepSeek Janus models, Qwen2.5-VL-72B-Instruct, QVQ-72B-Preview, and Pixtral 12B.
arXiv Detail & Related papers (2025-02-23T04:01:43Z)
A NotSo Simple Way to Beat Simple Bench [0.0]
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs)<n>We propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness.<n>Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts.
arXiv Detail & Related papers (2024-12-12T16:04:31Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.