Benchmarking Reasoning Robustness in Large Language Models
- URL: http://arxiv.org/abs/2503.04550v1
- Date: Thu, 06 Mar 2025 15:36:06 GMT
- Title: Benchmarking Reasoning Robustness in Large Language Models
- Authors: Tong Yu, Yongcheng Jing, Xikun Zhang, Wentao Jiang, Wenjie Wu, Yingjie Wang, Wenbin Hu, Bo Du, Dacheng Tao,
- Abstract summary: We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
- Score: 76.79744000300363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent success of large language models (LLMs) in reasoning such as DeepSeek, we for the first time identify a key dilemma in reasoning robustness and generalization: significant performance degradation on novel or incomplete data, suggesting a reliance on memorized patterns rather than systematic reasoning. Our closer examination reveals four key unique limitations underlying this issue:(1) Positional bias--models favor earlier queries in multi-query inputs but answering the wrong one in the latter (e.g., GPT-4o's accuracy drops from 75.8 percent to 72.8 percent); (2) Instruction sensitivity--performance declines by 5.0 to 7.5 percent in the Qwen2.5 Series and by 5.0 percent in DeepSeek-V3 with auxiliary guidance; (3) Numerical fragility--value substitution sharply reduces accuracy (e.g., GPT-4o drops from 97.5 percent to 82.5 percent, GPT-o1-mini drops from 97.5 percent to 92.5 percent); and (4) Memory dependence--models resort to guesswork when missing critical data. These findings further highlight the reliance on heuristic recall over rigorous logical inference, demonstrating challenges in reasoning robustness. To comprehensively investigate these robustness challenges, this paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps. This is achieved by an instruction-based approach to generate diverse datasets that closely resemble training distributions, facilitating a holistic robustness assessment and advancing the development of more robust reasoning frameworks. Bad character(s) in field Abstract.
Related papers
- Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time [14.963731580129892]
We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level.
By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output.
arXiv Detail & Related papers (2025-04-12T21:25:32Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.
Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.
We propose Reasoning-Driven Process Reward Modeling (R-PRM)
R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT [0.0]
This study introduces a novel benchmark that integrates multi-image reasoning tasks with rejection-based evaluation and positional bias detection.<n>We applied this benchmark to assess Grok 3, ChatGPT-4o, ChatGPT-o1, Gemini 2.0 Flash Experimental, DeepSeek Janus models, Qwen2.5-VL-72B-Instruct, QVQ-72B-Preview, and Pixtral 12B.
arXiv Detail & Related papers (2025-02-23T04:01:43Z) - Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models [1.4956870931936515]
We introduce RDoLT, a novel framework that significantly boosts Large Language Models reasoning performance.<n>RDoLT is built on three key innovations: (1) breaking down complex reasoning tasks into sub-tasks of progressive complexity; (2) employing an advanced selection and scoring mechanism to identify the most promising reasoning thoughts; and (3) integrating a knowledge propagation module that mimics human learning.<n>Our approach was evaluated across multiple benchmarks, including GSM8K, SVAMP, MultiArithm, LastLetterConcatenation, and Gaokao2023 Math.
arXiv Detail & Related papers (2025-01-03T02:55:44Z) - STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions [6.19084217044276]
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing.<n>We introduce the Sensitivity Testing on Offensive Progressions dataset, which includes 450 offensive progressions containing 2,700 unique sentences.<n>Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%.
arXiv Detail & Related papers (2024-09-20T18:34:38Z) - Uncovering Weaknesses in Neural Code Generation [21.552898575210534]
We assess the quality of generated code using match-based and execution-based metrics, then conduct thematic analysis to develop a taxonomy of nine types of weaknesses.
In the CoNaLa dataset, inaccurate prompts are a notable problem, causing all large models to fail in 26.84% of cases.
Missing pivotal semantics is a pervasive issue across benchmarks, with one or more large models omitting key semantics in 65.78% of CoNaLa tasks.
All models struggle with proper API usage, a challenge amplified by vague or complex prompts.
arXiv Detail & Related papers (2024-07-13T07:31:43Z) - Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results.
We develop a method that avoids introducing external resources, relying instead on perturbations to the input.
Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - Llamas Know What GPTs Don't Show: Surrogate Models for Confidence
Estimation [70.27452774899189]
Large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user.
As of November 2023, state-of-the-art LLMs do not provide access to these probabilities.
Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets.
arXiv Detail & Related papers (2023-11-15T11:27:44Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - Tactile Grasp Refinement using Deep Reinforcement Learning and Analytic
Grasp Stability Metrics [70.65363356763598]
We show that analytic grasp stability metrics constitute powerful optimization objectives for reinforcement learning algorithms.
We show that a combination of geometric and force-agnostic grasp stability metrics yields the highest average success rates of 95.4% for cuboids.
In a second experiment, we show that grasp refinement algorithms trained with contact feedback perform up to 6.6% better than a baseline that receives no tactile information.
arXiv Detail & Related papers (2021-09-23T09:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.