Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
- URL: http://arxiv.org/abs/2505.17288v1
- Date: Thu, 22 May 2025 21:05:04 GMT
- Title: Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
- Authors: Seamus Somerstep, Vinod Raman, Unique Subedi, Yuekai Sun,
- Abstract summary: We theoretically compare two methods for adapting large language models to new tasks.<n> supervised fine-tuning outperforms BoN if the learning setting is realizable.<n>If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence.
- Score: 18.735020493881006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as supervised fine-tuning, involves training a new next token predictor on good generations. The second method, Best-of-N, trains a reward model to select good responses from a collection generated by an unaltered base model. If the learning setting is realizable, we find that supervised fine-tuning outperforms BoN through a better dependence on the response length in its rate of convergence. If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence in either n or a rate of convergence with better dependence on the response length.
Related papers
- Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute [57.16286134405821]
We propose Fractional Reasoning, a framework that enables continuous control over reasoning intensity at inference time.<n>Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor.<n> Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
arXiv Detail & Related papers (2025-06-18T21:15:59Z) - Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding [64.2888389315149]
Test-time scaling improves large language model performance by adding extra compute during decoding.<n>Best-of-N sampling serves as a common scaling technique, broadening the search space for finding better solutions.<n>We propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings.
arXiv Detail & Related papers (2025-03-03T11:21:01Z) - BOND: Aligning LLMs with Best-of-N Distillation [63.254031574394965]
We propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time.
Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution.
We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
arXiv Detail & Related papers (2024-07-19T18:38:25Z) - Variational Best-of-N Alignment [57.617866305771756]
Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences.<n>We propose to fine-tune the language model to mimic what BoN does during inference.<n>Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN)
arXiv Detail & Related papers (2024-07-08T15:59:44Z) - Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment [7.349727826230864]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding.<n>Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective.<n>We propose a variant of BoN that aims to mitigate reward hacking at inference time by incorporating the Minimum Bayes Risk (MBR) objective as a proximity regularization term.
arXiv Detail & Related papers (2024-04-01T11:26:50Z) - Contrastive Neural Ratio Estimation for Simulation-based Inference [15.354874711988662]
Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task.
In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term.
We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on.
arXiv Detail & Related papers (2022-10-11T00:12:51Z) - Lookback for Learning to Branch [77.32867454769936]
Bipartite Graph Neural Networks (GNNs) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers.
Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) in branch-and-bound (B&B) solvers.
arXiv Detail & Related papers (2022-06-30T02:33:32Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - Belief Propagation Neural Networks [103.97004780313105]
We introduce belief propagation neural networks (BPNNs)
BPNNs operate on factor graphs and generalize Belief propagation (BP)
We show that BPNNs converges 1.7x faster on Ising models while providing tighter bounds.
On challenging model counting problems, BPNNs compute estimates 100's of times faster than state-of-the-art handcrafted methods.
arXiv Detail & Related papers (2020-07-01T07:39:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.