Predicting and improving test-time scaling laws via reward tail-guided search
- URL: http://arxiv.org/abs/2602.01485v1
- Date: Sun, 01 Feb 2026 23:40:25 GMT
- Title: Predicting and improving test-time scaling laws via reward tail-guided search
- Authors: Muheng Li, Jian Qian, Wenlong Mou,
- Abstract summary: Test-time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models.<n>We propose new methodologies to predict and improve scaling properties via tail-guided search.<n>By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations.
- Score: 11.49701649103495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test-time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models (LLMs). Though the straight-forward ''best-of-$N$'' (BoN) strategy has already demonstrated significant improvements in performance, it lacks principled guidance on the choice of $N$, budget allocation, and multi-stage decision-making, thereby leaving substantial room for optimization. While many works have explored such optimization, rigorous theoretical guarantees remain limited. In this work, we propose new methodologies to predict and improve scaling properties via tail-guided search. By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations. Leveraging this prediction tool, we introduce Scaling-Law Guided (SLG) Search, a new test-time algorithm that dynamically allocates compute to identify and exploit intermediate states with the highest predicted potential. We theoretically prove that SLG achieves vanishing regret compared to perfect-information oracles, and achieves expected rewards that would otherwise require a polynomially larger compute budget required when using BoN. Empirically, we validate our framework across different LLMs and reward models, confirming that tail-guided allocation consistently achieves higher reward yields than Best-of-$N$ under identical compute budgets. Our code is available at https://github.com/PotatoJnny/Scaling-Law-Guided-search.
Related papers
- $\
abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z) - $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z) - Predictable Scale: Part II, Farseer: A Refined Scaling Law in Large Language Models [62.3458061002951]
We introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales.<n>By systematically constructing a model loss surface $L(N,D)$, Farseer achieves a significantly better fit to empirical data than prior laws.<n>Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities.
arXiv Detail & Related papers (2025-06-12T17:59:23Z) - Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order [39.25335214877435]
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks.<n>Traditional first-order algorithms incur prohibitive memory and computational costs that scale poorly with model size.<n>We propose zero-order (ZO) optimization methods as a memory- and compute-efficient alternative.
arXiv Detail & Related papers (2025-06-04T20:27:17Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - Optimizing Anytime Reasoning via Budget Relative Policy Optimization [70.32755424260336]
We present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance.<n>We truncate the complete thinking process to fit within sampled token budgets from a prior distribution.<n>We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward.
arXiv Detail & Related papers (2025-05-19T17:58:44Z) - Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment [54.787826863212146]
Inference-time computation offers a powerful axis for scaling the performance of language models.<n>We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute.<n>We introduce $textttInferenceTimePessimism$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute.
arXiv Detail & Related papers (2025-03-27T18:00:08Z) - $φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation [22.607133083903125]
In-time optimization scales computation to derive deliberate reasoning steps for effective performance.<n>We frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation.<n>Experiments show $phi$-Decoding outperforms strong baselines in both performance and efficiency.
arXiv Detail & Related papers (2025-03-17T15:38:33Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.