Related papers: Theoretical guarantees on the best-of-n alignment policy

Theoretical guarantees on the best-of-n alignment policy

URL: http://arxiv.org/abs/2401.01879v1
Date: Wed, 3 Jan 2024 18:39:13 GMT
Title: Theoretical guarantees on the best-of-n alignment policy
Authors: Ahmad Beirami and Alekh Agarwal and Jonathan Berant and Alexander D'Amour, and Jacob Eisenstein and Chirag Nagpal and Ananda Theertha Suresh
Abstract summary: We show that the KL divergence between the best-of-$n$ policy and the base policy is equal to $log (n) - (n-1)/n.$ We propose a new estimator for the KL divergence and empirically show that it provides a tight approximation through a few examples.
Score: 110.21094183592358
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A simple and effective method for the alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a base policy, and ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the base policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes. Finally, we propose a new estimator for the KL divergence and empirically show that it provides a tight approximation through a few examples.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification [19.315342870604113]
Group Relative Policy Optimization was introduced and used successfully to train DeepSeek R1 models. We show in this paper that GRPO with verifiable rewards can be written as a Kullback Leibler ($mathsfKL$) regularized contrastive loss.
arXiv Detail & Related papers (2025-03-09T14:36:45Z)
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Distributionally Robust Policy Learning under Concept Drifts [33.44768994272614]
This paper studies a more nuanced problem -- robust policy learning under the concept drift. We first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class.
arXiv Detail & Related papers (2024-12-18T19:53:56Z)
Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z)
Information Theoretic Guarantees For Policy Alignment In Large Language Models [19.315342870604113]
We show that the $sqrtmathsfKL$ information theoretic upper bound holds if the reward under the reference policy has sub-gaussian tails. We also prove for the best of $n$ policy, that the $mathsfKL$ upper bound can be obtained for any $f$-divergence.
arXiv Detail & Related papers (2024-06-09T18:41:50Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Asymptotics of Language Model Alignment [27.37118975691123]
We show that the optimal KL-constrained RL solution satisfies a large deviation principle. We also show that the rate of growth of the scaled cumulants of the reward is characterized by proper Renyi cross entropy.
arXiv Detail & Related papers (2024-04-02T08:40:07Z)
Thompson Exploration with Best Challenger Rule in Best Arm Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework. We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z)
Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z)
The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy. emphnatural policy gradient (NPG) to converge to a globally optimal. policy at an $O (1/t) rate gradient. We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z)
On Gap-dependent Bounds for Offline Reinforcement Learning [40.92345387517103]
This paper presents a systematic study on gap-dependent sample complexity in offline reinforcement learning. Under the optimal policy coverage assumption, the rate can be improved to $Oleft(frac1epsilonright)$ when there is a positive sub-optimality gap in the optimal $Q$-function. We show when the visitation probabilities of the behavior policy are uniformly lower bounded for states where an optimal policy's visitation probabilities are positive, the sample complexity of identifying an optimal policy is independent of $frac1epsilon$.
arXiv Detail & Related papers (2022-06-01T01:44:12Z)
Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used. Second, to explain these findings we introduce the concept of committal rate for policy optimization. Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z)
Differentiable Bandit Exploration [38.81737411000074]
We learn such policies for an unknown distribution $mathcalP$ using samples from $mathcalP$. Our approach is a form of meta-learning and exploits properties of $mathcalP$ without making strong assumptions about its form.
arXiv Detail & Related papers (2020-02-17T05:07:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.