Related papers: Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

URL: http://arxiv.org/abs/2511.09345v1
Date: Thu, 13 Nov 2025 01:47:53 GMT
Title: Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling
Authors: Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che,
Abstract summary: Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs.<n>We propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency.
Score: 55.026048429595384
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs. Although recent studies have reduced token consumption through dynamic self-consistency, they remain constrained by the high latency of sequential requests. In this paper, we propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency by integrating System 1 and System 2 reasoning. Specifically, we utilize the rapid System 1 to compute the answer entropy for given queries. This score is then used to evaluate the potential of samples for scaling, enabling dynamic self-consistency under System 2. Benefiting from the advance and accurate estimation provided by System 1, the proposed method can reduce token usage while simultaneously achieving a significant decrease in latency through parallel generation. It outperforms existing methods, achieving up to a 47% reduction in token consumption and a 43% reduction in inference latency without significant performance loss.

Related papers

Agentic Test-Time Scaling for WebAgents [65.5178428849495]
We present Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious.<n>CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling.
arXiv Detail & Related papers (2026-02-12T18:58:30Z)
TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z)
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z)
Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z)
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction [112.54016379556073]
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.<n>We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework.<n>We show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement.
arXiv Detail & Related papers (2025-09-18T16:55:09Z)
Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency [3.6199690908942546]
Self-Consistency (SC) generates multiple reasoning chains in parallel and selects the final answer via majority voting.<n>We propose Slim-SC, a step-wise pruning strategy that identifies and removes redundant chains using inter-chain similarity at the thought level.<n> Experiments show that Slim-SC reduces latency and KVC usage by up to 45% and 26%, respectively, with R1-Distill.
arXiv Detail & Related papers (2025-09-17T14:00:51Z)
EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving [64.15371139980802]
Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP)<n>We show that different test-time scaling strategies for ATP models introduce significant computational overhead for inference.<n>We propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits.
arXiv Detail & Related papers (2025-09-16T03:00:13Z)
Latency and Token-Aware Test-Time Compute [3.573250939705335]
Inference-time scaling can improve large language model (LLM) performance by generating multiple candidate responses and selecting among them.<n>We formulate inference-time scaling as a problem of dynamic compute allocation and method selection.<n>Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic models.
arXiv Detail & Related papers (2025-09-11T21:35:19Z)
Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z)
Path-Consistency with Prefix Enhancement for Efficient Inference in LLMs [3.6696973040141034]
path-consistency improves inference latency by up to 40.5%, while maintaining task accuracy across various tasks.<n>Our empirical results demonstrate that path-consistency improves inference latency by up to 40.5%, while maintaining task accuracy across various tasks.
arXiv Detail & Related papers (2024-08-25T01:45:53Z)
An Efficiency Study for SPLADE Models [5.725475501578801]
In this paper, we focus on improving the efficiency of the SPLADE model. We propose several techniques including L1 regularization for queries, a separation of document/ encoders, a FLOPS-regularized middle-training, and the use of faster query encoders.
arXiv Detail & Related papers (2022-07-08T11:42:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.