EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
- URL: http://arxiv.org/abs/2511.10201v1
- Date: Fri, 14 Nov 2025 01:38:30 GMT
- Title: EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
- Authors: Junquan Huang, Haotian Wu, Yubo Gao, Yibo Yan, Junyan Zhang, Yonghua Hei, Song Dai, Jie Zhang, Puay Siew Tan, Xuming Hu,
- Abstract summary: We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods.<n>To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA.<n>We propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities.
- Score: 32.041688648831794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.
Related papers
- Characterizing, Evaluating, and Optimizing Complex Reasoning [44.98294610511283]
Large Reasoning Models increasingly rely on reasoning traces with complex internal structures.<n>Existing work lacks a unified answer to three fundamental questions.<n>What defines high-quality reasoning, how to reliably evaluate long, implicitly structured reasoning traces, and how to use such evaluation signals for reasoning optimization.
arXiv Detail & Related papers (2026-02-09T10:51:14Z) - Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs [20.82580343824728]
Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks.<n>This saturation stems from the dominance of template-based computation and shallow arithmetic decomposition.<n>We introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning.
arXiv Detail & Related papers (2026-01-31T07:09:17Z) - AI Founding Fathers: A Case Study of GIS Search in Multi-Agent Pipelines [0.0]
Large Language Models (LLMs) show exceptional fluency, but efforts persist to extract stronger reasoning capabilities from them.<n>This paper advances a systematic framework for understanding LLM reasoning and optimization.
arXiv Detail & Related papers (2025-11-12T05:52:55Z) - Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal [13.035073453917088]
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT)<n>We propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression.<n> ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs.
arXiv Detail & Related papers (2025-08-08T03:46:21Z) - Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It [1.6261897792391753]
We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi.<n>We uncover pervasive flaws in both benchmark items and evaluation methodology.
arXiv Detail & Related papers (2025-06-30T13:57:28Z) - PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens [51.90059610606049]
This paper revisits the efficiency of such reasoning processes through an information-theoretic lens.<n>We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution.<n>Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high.
arXiv Detail & Related papers (2025-05-23T13:38:56Z) - Efficient Inference for Large Reasoning Models: A Survey [74.17203483365171]
Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason.<n>However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time.<n>This survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality.
arXiv Detail & Related papers (2025-03-29T13:27:46Z) - Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [49.61246073215651]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.<n>However, they also introduce significant computational overhead due to verbose and redundant outputs.
arXiv Detail & Related papers (2025-03-20T17:59:38Z) - Quantifying Logical Consistency in Transformers via Query-Key Alignment [20.636818928993684]
We propose a novel, lightweight evaluation strategy for logical reasoning.<n>By computing a single forward pass and extracting a "QK-score" from carefully chosen heads, our method reveals latent representations that reliably separate valid from invalid inferences.
arXiv Detail & Related papers (2025-02-24T10:02:50Z) - Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [78.28188747489769]
We propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge.<n>In a self-training loop, EvalPlanner iteratively optimize over synthetically constructed evaluation plans and executions.<n>Our method achieves a new state-of-the-art performance for generative reward models on RewardBench.
arXiv Detail & Related papers (2025-01-30T02:21:59Z) - Language Model Preference Evaluation with Multiple Weak Evaluators [89.90733463933431]
We introduce PGED, a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results.<n>We demonstrate PGED's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning.
arXiv Detail & Related papers (2024-10-14T01:57:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.