Related papers: Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

URL: http://arxiv.org/abs/2503.01422v1
Date: Mon, 03 Mar 2025 11:21:01 GMT
Title: Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding
Authors: Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang,
Abstract summary: Test-time scaling improves large language model performance by adding extra compute during decoding.<n>Best-of-N sampling serves as a common scaling technique, broadening the search space for finding better solutions.<n>We propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings.
Score: 64.2888389315149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test-time scaling improves large language model performance by adding extra compute during decoding. Best-of-N (BoN) sampling serves as a common scaling technique, broadening the search space for finding better solutions from the model distribution. However, traditional BoN requires N full generations, leading to high GPU memory overhead and time latency. Moreover, some methods depend on reward models, adding computational cost and limiting domain generalization. In this paper, we propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings and eliminates the need for reward models. ST-BoN introduces early sampling consistency to estimate the most promising sample, truncating suboptimal ones to free memory and accelerate inference. This pushes the sampling-efficient test-time scaling. Compared to traditional BoN, ST-BoN can reduce dynamic GPU memory overhead by over 90% and time latency by 50%, while achieving comparable or even better performance across reasoning and open-ended domains.

Related papers

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
Evaluation of Best-of-N Sampling Strategies for Language Model Alignment [6.4706370001155955]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding.<n>Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling.<n>This paper proposes an extension of the RBoN framework, called RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN proxy reward.
arXiv Detail & Related papers (2025-02-18T09:18:02Z)
Efficient NeRF Optimization -- Not All Samples Remain Equally Hard [9.404889815088161]
We propose an application of online hard sample mining for efficient training of Neural Radiance Fields (NeRF) NeRF models produce state-of-the-art quality for many 3D reconstruction and rendering tasks but require substantial computational resources.
arXiv Detail & Related papers (2024-08-06T13:49:01Z)
Variational Best-of-N Alignment [57.617866305771756]
Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences.<n>We propose to fine-tune the language model to mimic what BoN does during inference.<n>Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
arXiv Detail & Related papers (2024-07-08T15:59:44Z)
An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning [55.467047686093025]
A common approach to alleviate such forgetting is to rehearse samples from prior tasks during fine-tuning.<n>We propose a sampling scheme, textttbf mix-cd, that prioritizes rehearsal of collateral damage'' samples.<n>Our approach is computationally efficient, easy to implement, and outperforms several leading continual learning methods in compute-constrained settings.
arXiv Detail & Related papers (2024-02-12T22:32:12Z)
Decreasing the Computing Time of Bayesian Optimization using Generalizable Memory Pruning [56.334116591082896]
We show a wrapper of memory pruning and bounded optimization capable of being used with any surrogate model and acquisition function. Running BO on high-dimensional or massive data sets becomes intractable due to this time complexity. All model implementations are run on the MIT Supercloud state-of-the-art computing hardware.
arXiv Detail & Related papers (2023-09-08T14:05:56Z)
Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model [0.0]
Excessive overhead leads to large latency and computational costs. We propose a model accelaration approaches for large language models. Our model achieves an 18x FLOPs speedup with an accuracy degradation of less than 8% compared to BERT.
arXiv Detail & Related papers (2023-05-21T13:30:56Z)
Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO) TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models. Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z)
Fast Bayesian Optimization of Needle-in-a-Haystack Problems using Zooming Memory-Based Initialization [73.96101108943986]
A Needle-in-a-Haystack problem arises when there is an extreme imbalance of optimum conditions relative to the size of the dataset. We present a Zooming Memory-Based Initialization algorithm that builds on conventional Bayesian optimization principles.
arXiv Detail & Related papers (2022-08-26T23:57:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.