Generalized Preference Optimization: A Unified Approach to Offline Alignment
- URL: http://arxiv.org/abs/2402.05749v2
- Date: Tue, 28 May 2024 23:25:15 GMT
- Title: Generalized Preference Optimization: A Unified Approach to Offline Alignment
- Authors: Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, Bilal Piot,
- Abstract summary: We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions.
GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases.
Our results present new algorithmic toolkits and empirical insights to alignment practitioners.
- Score: 54.97015778517253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In a controlled setting akin to Gao et al 2023, we also show that different GPO variants achieve similar trade-offs between regularization and performance, though the optimal values of hyper-parameter might differ as predicted by theory. In all, our results present new algorithmic toolkits and empirical insights to alignment practitioners.
Related papers
- Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
We present a new offline alignment algorithm, $chi2$-Preference Optimization ($chi$PO)
$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.
It is provably robust to overoptimization and achieves sample-complexity guarantees based on single-policy concentrability.
arXiv Detail & Related papers (2024-07-18T11:08:40Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Amortized Proximal Optimization [11.441395750267052]
Amortized Proximal Optimization (APO) is a framework for online meta-optimization of parameters that govern optimization.
We show how APO can be used to adapt a learning rate or a structured preconditioning matrix.
We empirically test APO for online adaptation of learning rates and structured preconditioning for regression, image reconstruction, image classification, and natural language translation tasks.
arXiv Detail & Related papers (2022-02-28T20:50:48Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Nonmyopic Gaussian Process Optimization with Macro-Actions [13.847308344546171]
This paper presents a multi-staged approach to nonmyopic adaptive Gaussian process optimization (GPO)
It exploits the notion of macro-actions for scaling up to a further lookahead to match up to a larger available budget.
We empirically evaluate the performance of our epsilon-Macro-GPO policy and its anytime variant in BO datasets with synthetic and real-world datasets.
arXiv Detail & Related papers (2020-02-22T09:56:20Z) - Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization [71.03797261151605]
Adaptivity is an important yet under-studied property in modern optimization theory.
Our algorithm is proved to achieve the best-available convergence for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.
arXiv Detail & Related papers (2020-02-13T05:42:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.