Do Stop Me Now: Detecting Boilerplate Responses with a Single Iteration
- URL: http://arxiv.org/abs/2510.22679v1
- Date: Sun, 26 Oct 2025 13:43:56 GMT
- Title: Do Stop Me Now: Detecting Boilerplate Responses with a Single Iteration
- Authors: Yuval Kainan, Shaked Zychlinski,
- Abstract summary: Large Language Models (LLMs) often expend significant computational resources generating boilerplate responses.<n>We propose a simple yet highly effective method for detecting such responses after only a single generation step.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) often expend significant computational resources generating boilerplate responses, such as refusals, simple acknowledgements and casual greetings, which adds unnecessary cost and latency. To address this inefficiency, we propose a simple yet highly effective method for detecting such responses after only a single generation step. We demonstrate that the log-probability distribution of the first generated token serves as a powerful signal for classifying the nature of the entire subsequent response. Our experiments, conducted across a diverse range of small, large, and reasoning-specialized models, show that the first-token log-probability vectors form distinctly separable clusters for different response types. Using a lightweight k-NN classifier, we achieve high accuracy in predicting whether a response will be a substantive answer or a form of boilerplate response, including user-specified refusals. The primary implication is a practical, computationally trivial technique, optimizing LLM inference by enabling early termination or redirection to a smaller model, thereby yielding significant savings in computational cost. This work presents a direct path toward more efficient and sustainable LLM deployment.
Related papers
- GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts [10.808072653940263]
Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models.<n>We propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token.<n>Glimp employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold.
arXiv Detail & Related papers (2026-01-08T16:58:07Z) - Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z) - Learning to Reason in LLMs by Expectation Maximization [55.721496945401846]
We formalize reasoning as a latent variable model and derive an expectation-maximization objective for learning to reason.<n>This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers.
arXiv Detail & Related papers (2025-12-23T08:56:49Z) - LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z) - Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling [38.27469349005585]
Test-time scaling is a powerful paradigm for enhancing the reasoning capabilities of large language models.<n>Test-time scaling is inherently inefficient due to the generation of redundant and repetitive reasoning traces.<n>We introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating test-time scaling.
arXiv Detail & Related papers (2025-08-30T01:54:55Z) - Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z) - Representation Consistency for Accurate and Coherent LLM Answer Aggregation [20.494987341489573]
representation consistency (RC) is a test-time scaling method for aggregating answers drawn from multiple candidate responses of an large language model.<n>RC enhances answer aggregation by considering the number of occurrences of each answer in the candidate response set.<n>Our method only uses cached activations and lightweight similarity computations and requires no additional model queries.
arXiv Detail & Related papers (2025-06-18T05:07:47Z) - LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling [39.61818305829112]
We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step.<n>PIR identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components.<n>Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets.
arXiv Detail & Related papers (2025-05-25T15:17:57Z) - Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.