ReEfBench: Quantifying the Reasoning Efficiency of LLMs
- URL: http://arxiv.org/abs/2601.03550v1
- Date: Wed, 07 Jan 2026 03:33:07 GMT
- Title: ReEfBench: Quantifying the Reasoning Efficiency of LLMs
- Authors: Zhizhang Fu, Yuancheng Gu, Chenkai Hu, Hanmeng Liu, Yue Zhang,
- Abstract summary: We propose a novel neuro-symbolic framework for the non-intrusive, comprehensive process-centric evaluation of reasoning.<n>Our analysis reveals that extended token generation is not a prerequisite for deep reasoning.
- Score: 9.462320482705508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time scaling has enabled Large Language Models (LLMs) to tackle complex reasoning, yet the limitations of current Chain-of-Thought (CoT) evaluation obscures whether performance gains stem from genuine reasoning or mere verbosity. To address this, (1) we propose a novel neuro-symbolic framework for the non-intrusive, comprehensive process-centric evaluation of reasoning. (2) Through this lens, we identify four distinct behavioral prototypes and diagnose the failure modes. (3) We examine the impact of inference mode, training strategy, and model scale. Our analysis reveals that extended token generation is not a prerequisite for deep reasoning. Furthermore, we reveal critical constraints: mixing long and short CoT data in training risks in premature saturation and collapse, while distillation into smaller models captures behavioral length but fails to replicate logical efficacy due to intrinsic capacity limits.
Related papers
- Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning [62.680551162054975]
We introduce an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization.<n>We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows.<n>Our Accordion-Thinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead.
arXiv Detail & Related papers (2026-02-03T08:34:20Z) - Structured Reasoning for Large Language Models [59.215789462977206]
We propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components.<n>SCR substantially improves reasoning efficiency and self-verification.<n>Compared with existing reasoning paradigms, it reduces output token length by up to 50%.
arXiv Detail & Related papers (2026-01-12T04:04:01Z) - ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z) - Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z) - Lost at the Beginning of Reasoning [85.17612793300238]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.
arXiv Detail & Related papers (2025-06-27T09:53:57Z) - A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap [0.39073867995073247]
We argue that the observed failure is not evidence of a fundamental cognitive boundary, but rather a predictable outcome of system-level constraints.<n>A model, initially declaring a puzzle impossible when confined to text-only generation, now employs agentic tools to not only solve it but also master variations of complexity far beyond the reasoning cliff it previously failed to surmount.
arXiv Detail & Related papers (2025-06-23T17:14:21Z) - Excessive Reasoning Attack on Reasoning LLMs [26.52688123765127]
In this work, we expose a novel threat: adversarial inputs can be crafted to exploit excessive reasoning behaviors.<n>Our results demonstrate a 3x to 9x increase in reasoning length with comparable utility performance.<n>Our crafted adversarial inputs exhibit transferability, inducing computational overhead in o3-mini, o1-mini, DeepSeek-R1, and QWQ models.
arXiv Detail & Related papers (2025-06-17T10:16:52Z) - Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills [42.1825027925353]
Large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation.<n>We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs.<n>We propose Reasoning-aware Representation Misdirection for Unlearning ($R2MU$), a novel method that effectively suppresses sensitive reasoning traces.
arXiv Detail & Related papers (2025-06-15T20:54:23Z) - Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study [91.78803511141975]
This work focuses on the roles of positive and negative samples in scaling reinforcement learning.<n>We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage.<n>We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes.
arXiv Detail & Related papers (2025-06-05T11:47:10Z) - Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models [7.2703757624760526]
Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling.<n>As we push these scaling boundaries, understanding the practical limits and achieving optimal resource allocation becomes a critical challenge.<n>In this paper, we investigate the scaling plateau of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM)
arXiv Detail & Related papers (2025-05-26T20:58:45Z) - When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs [19.354141845315276]
Chain-of-thought reasoning can significantly degrade instruction-following accuracy.<n>This is the first work to systematically expose reasoning-induced failures in instruction-following.
arXiv Detail & Related papers (2025-05-16T16:36:00Z) - The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning [56.574829311863446]
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs)<n>We demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities.<n>Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT's performance in pattern-based ICL.
arXiv Detail & Related papers (2025-04-07T13:51:06Z) - Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models [104.55763564037831]
We train a regression model that leverages attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens.<n>Our evaluation shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.
arXiv Detail & Related papers (2024-08-20T09:42:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.