Related papers: Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning

Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning

URL: http://arxiv.org/abs/2511.04654v1
Date: Thu, 06 Nov 2025 18:43:16 GMT
Title: Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning
Authors: Mohammad Atif Quamar, Mohammad Areeb,
Abstract summary: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models.<n>We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30--35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.

Related papers

The Path of Least Resistance: Guiding LLM Reasoning Trajectories with Prefix Consensus [1.7798494187119376]
We introduce PoLR, the first inference-time method to leverage prefix consistency for compute-efficient reasoning.<n>PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands all paths in that cluster.<n> Empirically, PoLR consistently matches or exceeds SC across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND.
arXiv Detail & Related papers (2026-01-29T10:14:24Z)
ENTRA: Entropy-Based Redundancy Avoidance in Large Language Model Reasoning [30.786062954495403]
Large Reasoning Models (LRMs) often suffer from overthinking, generating unnecessarily long reasoning chains even for simple tasks.<n>We propose ENTRA, an entropy-based training framework that suppresses redundant reasoning while preserving performance.
arXiv Detail & Related papers (2026-01-12T01:26:30Z)
Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z)
SIM-CoT: Supervised Implicit Chain-of-Thought [108.30049193668083]
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models.<n>We identify a core latent instability issue when scaling the computational budget of implicit CoT.<n>We propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space.
arXiv Detail & Related papers (2025-09-24T17:01:32Z)
Adaptive Overclocking: Dynamic Control of Thinking Path Length via Real-Time Reasoning Signals [8.264189366042675]
We propose Adaptive Overclocking, a method that makes the hyper parameter $alpha$ dynamic and context-aware.<n>Our method adjusts reasoning speed in real time through two complementary signals.<n> Experiments on GSM8K, MATH, and SVAMP show that HAC achieves superior accuracy-latency trade-offs.
arXiv Detail & Related papers (2025-09-21T09:40:27Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)
Continuous Speculative Decoding for Autoregressive Image Generation [27.308442169466975]
Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation.<n> speculative decoding has effectively accelerated discrete autoregressive inference.<n>This work addresses challenges from low acceptance rate, inconsistent output distribution, and modified distribution without analytic expression.
arXiv Detail & Related papers (2024-11-18T09:19:15Z)
Training Chain-of-Thought via Latent-Variable Inference [30.21067593018967]
Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a chain-of-thought'' prompt. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers. We propose a fine-tuning strategy that tries to maximize the emphmarginal log-likelihood of generating a correct answer using CoT prompting.
arXiv Detail & Related papers (2023-11-28T17:47:32Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.