Related papers: Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

URL: http://arxiv.org/abs/2506.17114v3
Date: Thu, 31 Jul 2025 16:23:29 GMT
Title: Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models
Authors: Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yumeng Wang, Yi R. Fung,
Abstract summary: We introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems.<n>We thoroughly evaluate advanced models' performance on it.<n>Our analysis uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models.
Score: 11.250861762443801
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models' self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.

Related papers

Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving [48.22540519786074]
Recent studies show that the informal accuracy exceeds 80% while formal success remains below 8% on benchmarks like PutnamBench.<n>We propose a novel framework that decouples high-level reasoning from low-level proof generation.<n>We evaluate our method on a challenging set of post-2000 IMO problems, a problem set on which no prior open-source prover has reported success.
arXiv Detail & Related papers (2025-07-07T22:38:49Z)
Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories [0.0]
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks.<n>Yet, growing evidence shows that these models often generate longer but ineffective chains of thought (CoTs)<n>We present new evidence of overthinking, where models disregard correct solutions even when explicitly provided, instead continuing to generate unnecessary reasoning steps.
arXiv Detail & Related papers (2025-07-01T12:14:22Z)
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [0.0]
Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds.<n>We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures.
arXiv Detail & Related papers (2025-06-10T21:16:53Z)
CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models [56.40065909544213]
Large language models (LLMs) benefit from increased test-time compute, a phenomenon known as test-time scaling.<n>However, reasoning-optimized models often overthink even simple problems, producing excessively verbose outputs and leading to low token efficiency.<n>We identify two key causes of this verbosity: (1) reinforcement learning reduces the information density of forward reasoning, and (2) backward chain-of thought training encourages redundant and often unnecessary verification steps.
arXiv Detail & Related papers (2025-05-28T06:24:45Z)
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models [27.437685534830457]
Large language models frequently exhibit a problematic reliance on familiar reasoning patterns.<n>Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories.<n>This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle.
arXiv Detail & Related papers (2025-05-22T19:00:01Z)
Evaluating the Logical Reasoning Abilities of Large Reasoning Models [15.009205651973666]
We introduce LogiEval, a benchmark for evaluating logical reasoning in large reasoning models.<n>LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis)<n>Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance.<n>Our analysis reveals that human performance does not mirror model failure distributions.
arXiv Detail & Related papers (2025-05-17T05:36:14Z)
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models [65.39456695678713]
We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists.<n>We find that in general, reasoning models are poorly calibrated, particularly on easy problems.<n>We introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
arXiv Detail & Related papers (2025-04-17T22:16:30Z)
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad [0.8030359871216614]
We evaluate state-of-the-art reasoning models on six problems from the 2025 USAMO.<n>Only Gemini-2.5-Pro achieves a non-trivial score of 25%.<n>Our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-27T19:21:05Z)
Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.<n>Current error classification methods rely on static and predefined categories.<n>We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z)
Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.<n>Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z)
MetaLogic: Logical Reasoning Explanations with Fine-Grained Structure [129.8481568648651]
We propose a benchmark to investigate models' logical reasoning capabilities in complex real-life scenarios. Based on the multi-hop chain of reasoning, the explanation form includes three main components. We evaluate the current best models' performance on this new explanation form.
arXiv Detail & Related papers (2022-10-22T16:01:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.