Related papers: Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

URL: http://arxiv.org/abs/2511.00699v2
Date: Tue, 04 Nov 2025 03:17:16 GMT
Title: Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals
Authors: Sophie Li, Nicholas Huang, Nayan Saxena, Nina Luo, Vincent Lin, Kevin Zhu, Sunishchal Dev,
Abstract summary: Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early.<n>We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function.
Score: 6.5422130090856925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N (BoN) incur high computational cost by fully generating all branches. Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early, but its reliance on consistency-based heuristics is a limitation as it does not directly evaluate branch quality. We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct demonstrate that KAPPA stabilizes performance in smaller models and achieves up to ~60% reduction in peak memory and ~90% reduction in total token generation relative to BoN, with minimal impact on accuracy.

Related papers

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z)
Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models [31.422773877490613]
Reasoning LLMs (RLMs) deliver strong multi-step reasoning through chain-of-thought generation.<n>RLMs' large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings.<n>We introduce RESP, a structured pruning framework that aligns pruning decisions with the model's reasoning dynamics.
arXiv Detail & Related papers (2025-12-01T20:27:05Z)
EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients [6.736735746633275]
Diffusion-based large language models (dLLMs) refine token generations through iterative denoising, but answers often stabilize before all steps complete.<n>We propose EDIT, an inference-time criterion that adaptively stops denoising once sufficient reasoning stability relative to training-time reasoning is detected.
arXiv Detail & Related papers (2025-11-29T23:47:47Z)
TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z)
NIRVANA: Structured pruning reimagined for large language models compression [50.651730342011014]
We introduce NIRVANA, a novel pruning method designed to balance immediate zero-shot preservation accuracy with robust fine-tuning.<n>To further address the unique challenges posed by structured pruning, NIRVANA incorporates an adaptive sparsity allocation mechanism across layers and modules.<n>Experiments conducted on Llama3, Qwen, T5 models demonstrate that NIRVANA outperforms existing structured pruning methods under equivalent sparsity constraints.
arXiv Detail & Related papers (2025-09-17T17:59:00Z)
Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z)
Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence [38.30075427255948]
Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs.<n>This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models.
arXiv Detail & Related papers (2025-05-23T18:19:09Z)
Stable Reinforcement Learning for Efficient Reasoning [2.838966689544288]
GRPO-$lambda$ is an efficient and stabilized variant of GRPO.<n>It dynamically adjusts the reward strategy by monitoring the correctness ratio.<n>It improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.
arXiv Detail & Related papers (2025-05-23T16:43:03Z)
Fast Quiet-STaR: Thinking Without Thought Tokens [51.79231070632772]
Fast Quiet STaR is a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost.<n>Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens.<n>Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy.
arXiv Detail & Related papers (2025-05-23T11:14:12Z)
Dynamic Early Exit in Reasoning Models [21.30793518631921]
Overthinking in long chain-of-thought (CoT) generation slows down the efficiency of problem solving, but also risks accuracy loss.<n>We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation.<n>Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs.
arXiv Detail & Related papers (2025-04-22T13:36:53Z)
GDP: Stabilized Neural Network Pruning via Gates with Differentiable Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest. GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel. Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.