Reasoning Models Reason Well, Until They Don't
- URL: http://arxiv.org/abs/2510.22371v1
- Date: Sat, 25 Oct 2025 17:28:38 GMT
- Title: Reasoning Models Reason Well, Until They Don't
- Authors: Revanth Rameshkumar, Jimson Huang, Yunxin Sun, Fei Xia, Abulhair Saparov,
- Abstract summary: Large language models (LLMs) have shown significant progress in reasoning tasks.<n>We revisit these findings through the lens of large reasoning models (LRMs)<n>LRMs fine-tuned with incentives for step-by-step argumentation and self-verification.
- Score: 8.434177922951582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification. LRM performance on graph and reasoning benchmarks such as NLGraph seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the Deep Reasoning Dataset (DeepRD), along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs' success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.
Related papers
- Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity [59.27594125465172]
We introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples.<n>We then introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data.
arXiv Detail & Related papers (2025-09-29T14:20:04Z) - GRIL: Knowledge Graph Retrieval-Integrated Learning with Large Language Models [59.72897499248909]
We propose a novel graph retriever trained end-to-end with Large Language Models (LLMs)<n>Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together.<n>Our approach consistently achieves state-of-the-art performance, validating the strength of joint graph-LLM optimization for complex reasoning tasks.
arXiv Detail & Related papers (2025-09-20T02:38:00Z) - From Long to Short: LLMs Excel at Trimming Own Reasoning Chains [48.692414597960244]
O1/R1 style large reasoning models (LRMs) signal a substantial leap forward over conventional instruction-following LLMs.<n>Recent studies show that LRMs are prone to suffer from overthinking.<n>We propose a test-time scaling method, EDIT, which efficiently guides LRMs to identify the shortest correct reasoning paths at test time.
arXiv Detail & Related papers (2025-09-07T19:00:44Z) - Language Models Coupled with Metacognition Can Outperform Reasoning Models [32.32646975975768]
Large language models (LLMs) excel in speed and adaptability across various reasoning tasks.<n>LRMs are specifically designed for complex, step-by-step reasoning.<n> SOFAI-LM coordinates a fast LLM with a slower but more powerful LRM through metacognition.
arXiv Detail & Related papers (2025-08-25T12:19:57Z) - Thinking Isn't an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations [11.503915439591735]
Large Reasoning Models (LRMs) are designed to output a step-by-step thinking process before arriving at a final answer to handle complex reasoning tasks.<n>Recent empirical studies suggest that LLMs without explicit reasoning actually outperform LRMs on tasks with low or high complexity.<n>We investigate whether the limitations of LRMs persist when tool augmentations are introduced.
arXiv Detail & Related papers (2025-07-23T17:04:20Z) - Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z) - The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [16.266145641151375]
Large Reasoning Models generate detailed thinking processes before providing answers.<n>We show that LRMs face a complete accuracy collapse beyond certain complexities.<n>We also investigate the reasoning traces in more depth, studying the patterns of explored solutions.
arXiv Detail & Related papers (2025-06-07T22:42:29Z) - Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges [4.668749313973097]
This paper systematically evaluate Large Language Models (LLMs) and Large Reasoning Models (LRMs) across three levels of reasoning complexity.<n>We curate 26 challenges where models answer directly or by Python Code Interpreter.<n>LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods.
arXiv Detail & Related papers (2025-05-16T18:32:35Z) - Large Language and Reasoning Models are Shallow Disjunctive Reasoners [15.56445409535547]
Large Language Models (LLMs) have been found to struggle with systematic reasoning.<n>This paper focuses on tasks that require systematic relational composition for qualitative spatial and temporal reasoning.<n>We find that, zero-shot LRMs generally outperform their LLM counterparts in single-path reasoning tasks but struggle in the multi-path setting.
arXiv Detail & Related papers (2025-03-30T15:41:55Z) - Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners [30.195361623027313]
Process Reward Models (PRMs) have demonstrated exceptional promise in enhancing reasoning by providing step-wise feedback.<n>We introduce GraphSILO, the largest dataset for graph reasoning problems with fine-grained step-wise labels.<n>We train GraphPRM, the first PRM designed for graph reasoning problems, and evaluate its effectiveness in two key settings.
arXiv Detail & Related papers (2025-03-02T10:39:40Z) - ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [92.76959707441954]
We introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance.<n>ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity.<n>Our results reveal a significant decline in accuracy as problem complexity grows.
arXiv Detail & Related papers (2025-02-03T06:44:49Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.