ABoN: Adaptive Best-of-N Alignment
- URL: http://arxiv.org/abs/2505.12050v1
- Date: Sat, 17 May 2025 15:24:48 GMT
- Title: ABoN: Adaptive Best-of-N Alignment
- Authors: Vinod Raman, Hilal Asi, Satyen Kale,
- Abstract summary: We propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently.<n>Our method is simple, practical, and compatible with any LM/RM combination.
- Score: 19.22348775001393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be computationally expensive, especially when applied uniformly across prompts without accounting for differences in alignment difficulty. In this work, we propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently. Motivated by latency concerns, we develop a two-stage algorithm: an initial exploratory phase estimates the reward distribution for each prompt using a small exploration budget, and a second stage adaptively allocates the remaining budget using these estimates. Our method is simple, practical, and compatible with any LM/RM combination. Empirical results on the AlpacaEval dataset for 12 LM/RM pairs and 50 different batches of prompts show that our adaptive strategy consistently outperforms the uniform allocation with the same inference budget. Moreover, our experiments show that our adaptive strategy remains competitive against uniform allocations with 20% larger inference budgets and even improves in performance as the batch size grows.
Related papers
- Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models [23.37504394417425]
We propose HIA (Heuristic-Guided Inference-time Alignment), a tuning-free, black-box-compatible approach that uses a lightweight prompt.<n>We find that HIA is effective under low-inference budgets with as little as one or two response queries.
arXiv Detail & Related papers (2025-08-07T08:54:27Z) - An Experimental Approach for Running-Time Estimation of Multi-objective Evolutionary Algorithms in Numerical Optimization [16.66619776655723]
We propose an experimental approach for estimating upper bounds on the running time of MOEAs without algorithmic assumptions.<n>We conduct comprehensive experiments on five representative MOEAs using the ZDT and DTLZ benchmark suites.<n>Results demonstrate the effectiveness of our approach in estimating upper bounds on the running time without requiring algorithmic or problem simplifications.
arXiv Detail & Related papers (2025-07-03T07:06:14Z) - ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization [48.50761200321113]
We introduce ConfPO, a method for preference learning in Large Language Models (LLMs)<n>It identifies and optimize preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or compute.<n> Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs.
arXiv Detail & Related papers (2025-06-10T11:54:22Z) - Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling [19.673388630963807]
Test-Time Scaling (TTS) improves the performance of Large Language Models (LLMs)<n>How to allocate a fixed rollout budget most effectively during search remains underexplored, often resulting in inefficient use of compute at test time.<n>We propose Direction-Oriented Resource Allocation (DORA), a provably optimal method that mitigates this bias.
arXiv Detail & Related papers (2025-05-30T09:05:25Z) - PIPA: Preference Alignment as Prior-Informed Statistical Estimation [57.24096291517857]
We introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework.<n> PIPA accommodates both paired and unpaired data, as well as answer and step-level annotations.<n>By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N.
arXiv Detail & Related papers (2025-02-09T04:31:30Z) - Sequential Stochastic Combinatorial Optimization Using Hierarchal Reinforcement Learning [5.57541853212632]
We propose a two-layer option-based framework that simultaneously decides adaptive budget allocation on the higher layer and node selection on the lower layer.<n> Empirical results show that WS-option exhibits significantly improved effectiveness and generalizability compared to traditional methods.
arXiv Detail & Related papers (2025-02-08T12:00:30Z) - The Differences Between Direct Alignment Algorithms are a Blur [3.0059120458540383]
Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM)<n>DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning phase is required (two-stage vs. one-stage)<n>These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.
arXiv Detail & Related papers (2025-02-03T10:54:14Z) - Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.<n>We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - $i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization [12.266207199002604]
Large Language Models (LLM) can sometimes produce outputs that deviate from human expectations.
We propose a novel framework named $i$REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization.
We show that $i$REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators.
arXiv Detail & Related papers (2024-05-24T05:42:11Z) - Active Preference Optimization for Sample Efficient RLHF [27.772423917657626]
Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models with human preferences.
Current methods rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations.
We develop an active-learning algorithm, $textttAPO$, which enhances model alignment by querying preference data.
arXiv Detail & Related papers (2024-02-16T08:19:34Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.