Entropy-Gated Branching for Efficient Test-Time Reasoning
- URL: http://arxiv.org/abs/2503.21961v3
- Date: Tue, 07 Oct 2025 16:06:25 GMT
- Title: Entropy-Gated Branching for Efficient Test-Time Reasoning
- Authors: Xianzhi Li, Ethan Callanan, Abdellah Ghassel, Xiaodan Zhu,
- Abstract summary: Test-time compute methods can significantly improve the reasoning capabilities and problem-solving accuracy of large language models (LLMs)<n>We propose Entropy-Gated Branching (EGB), which branches only at high-uncertainty steps and prunes expansions with a lightweight verifier.<n>On mathematical and financial reasoning benchmarks, EGB improves accuracy by 22.6% over standard inference while operating 31%-75% faster across math benchmarks than test-time beam search with higher performance.
- Score: 21.810952984561116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time compute methods can significantly improve the reasoning capabilities and problem-solving accuracy of large language models (LLMs). However, these approaches require substantially more computational resources, with most compute wasted on exploring low-diversity branches where the model already exhibits high confidence. We observe that a small subset of uncertain reasoning steps has a disproportionately large impact on final prediction accuracy, and branching at these critical junctures tends to yield more diverse and higher-quality candidate reasoning steps. We propose Entropy-Gated Branching (EGB), which branches only at high-uncertainty steps and prunes expansions with a lightweight verifier. On mathematical and financial reasoning benchmarks, EGB improves accuracy by 22.6% over standard inference while operating 31%-75% faster across math benchmarks than test-time beam search with higher performance. Our results show that dynamic resource allocation during inference can substantially improve both efficiency and effectiveness, offering a more scalable pathway to enhanced LLM reasoning capabilities.
Related papers
- ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z) - Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z) - Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention [46.18660010248197]
Minimal Test-Time Intervention (MTI) is a training-free framework that enhances reasoning accuracy and stability with minimal overhead.<n>MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning.
arXiv Detail & Related papers (2025-10-15T17:59:45Z) - Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling [12.835376812101323]
We introduce the hypothesis that PRMs are also Partial Reward Models.<n>This allows for principled early rejection based on intermediate token-level signals.<n>On math reasoning benchmarks, our method achieves up to 1.4$times$-9$times$ reduction in inference FLOPs without degrading final performance.
arXiv Detail & Related papers (2025-08-04T00:58:56Z) - Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z) - Thinking About Thinking: SAGE-nano's Inverse Reasoning for Self-Aware Language Models [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex reasoning tasks with Chain-of-Thought prompting.<n>We introduce textbfinverse reasoning, a novel paradigm enabling LLMs to decompose and explain their own reasoning chains post-hoc.<n>Our work creates new avenues for transparent AI systems and closes significant gaps in AI safety, education, and scientific discovery.
arXiv Detail & Related papers (2025-06-30T09:53:41Z) - Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning [71.3533541927459]
We propose a novel data selection paradigm termed Activation Reasoning Potential (RAP)<n>RAP identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning.<n>Our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.
arXiv Detail & Related papers (2025-06-05T08:40:24Z) - Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models [0.0]
We show that errors are not uniformly distributed but are concentrated at sparse "key tokens" representing critical decision junctions.<n>We propose a framework for next-generation systems centered on selective preservation of semantically vital tokens.
arXiv Detail & Related papers (2025-05-30T03:57:31Z) - PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models [68.96619605651155]
Large reasoning models (LRMs) may drastically increase the output length due to overthinking.<n>We propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns.<n>Our method achieves up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
arXiv Detail & Related papers (2025-05-27T20:59:29Z) - Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens [51.90059610606049]
This paper revisits the efficiency of such reasoning processes through an information-theoretic lens.<n>We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution.<n>Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high.
arXiv Detail & Related papers (2025-05-23T13:38:56Z) - Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning [81.50681925980135]
We propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps.<n>It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making.<n>Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results.
arXiv Detail & Related papers (2025-05-23T12:42:50Z) - Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning [5.774374541467044]
We present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics.<n>We show that FAST achieves state-of-the-art accuracy with over 10% relative improvement compared to the base model.
arXiv Detail & Related papers (2025-04-25T16:11:23Z) - Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods [39.89239733570008]
This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models.
We find that non-reasoning models, even with an extremely high inference budget, still fall substantially behind reasoning models.
For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods.
arXiv Detail & Related papers (2025-04-18T19:32:55Z) - THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models [65.39456695678713]
We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists.
We find that in general, reasoning models are poorly calibrated, particularly on easy problems.
We introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
arXiv Detail & Related papers (2025-04-17T22:16:30Z) - Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models [17.673293240849787]
We introduce SPHERE, a self-evolving data generation pipeline that enhances reasoning in small language models (SLMs)<n> SPHERE operates in three stages: (i) Self-Generation, where the model autonomously constructs problem-solving steps; (ii) Self-Correction, enabling it to identify and rectify errors; and (iii) Diversity Induction, improving robustness through multiple valid reasoning trajectories.<n>We show that SPHERE-trained models achieve significant gains over their base versions and match/surpass GPT-4o on certain benchmarks.
arXiv Detail & Related papers (2025-03-04T14:43:25Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - Quantifying Logical Consistency in Transformers via Query-Key Alignment [20.636818928993684]
We propose a novel, lightweight evaluation strategy for logical reasoning.<n>By computing a single forward pass and extracting a "QK-score" from carefully chosen heads, our method reveals latent representations that reliably separate valid from invalid inferences.
arXiv Detail & Related papers (2025-02-24T10:02:50Z) - Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving [0.0]
Recent advances in large language models (LLMs) have predominantly focused on maximizing accuracy and reasoning capabilities.<n>This paper investigates the potential synergy between reasoning enhancement and computational efficiency by analyzing the integration of two contrasting approaches.
arXiv Detail & Related papers (2024-12-20T08:42:45Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Token-Supervised Value Models for Enhancing Mathematical Problem-Solving Capabilities of Large Language Models [56.32800938317095]
Existing verifiers are sub-optimal for tree search techniques at test time.<n>We propose token-supervised value models (TVMs)<n>TVMs assign each token a probability that reflects the likelihood of reaching the correct final answer.
arXiv Detail & Related papers (2024-07-12T13:16:50Z) - Language Model Cascades: Token-level uncertainty and beyond [65.38515344964647]
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks.
Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs.
We show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform simple aggregation strategies.
arXiv Detail & Related papers (2024-04-15T21:02:48Z) - Exposing Attention Glitches with Flip-Flop Language Modeling [55.0688535574859]
This work identifies and analyzes the phenomenon of attention glitches in large language models.
We introduce flip-flop language modeling (FFLM), a family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models.
We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques.
arXiv Detail & Related papers (2023-06-01T17:44:35Z) - Identifying and Mitigating Spurious Correlations for Improving
Robustness in NLP Models [19.21465581259624]
Many problems can be attributed to models exploiting spurious correlations, or shortcuts between the training data and the task labels.
In this paper, we aim to automatically identify such spurious correlations in NLP models at scale.
We show that our proposed method can effectively and efficiently identify a scalable set of "shortcuts", and mitigating these leads to more robust models in multiple applications.
arXiv Detail & Related papers (2021-10-14T21:40:03Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.