Generalized Preference Optimization: A Unified Approach to Offline Alignment
- URL: http://arxiv.org/abs/2402.05749v2
- Date: Tue, 28 May 2024 23:25:15 GMT
- Title: Generalized Preference Optimization: A Unified Approach to Offline Alignment
- Authors: Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, Bilal Piot,
- Abstract summary: We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions.
GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases.
Our results present new algorithmic toolkits and empirical insights to alignment practitioners.
- Score: 54.97015778517253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In a controlled setting akin to Gao et al 2023, we also show that different GPO variants achieve similar trade-offs between regularization and performance, though the optimal values of hyper-parameter might differ as predicted by theory. In all, our results present new algorithmic toolkits and empirical insights to alignment practitioners.
Related papers
- Parameter Tracking in Federated Learning with Adaptive Optimization [14.111863825607001]
In Federated Learning (FL), model training performance is strongly impacted by data heterogeneity across clients.
Gradient Tracking (GT) has recently emerged as a solution which mitigates this issue by introducing correction terms to local model updates.
To date, GT has only been considered under Gradient (SGD)-based model Descent training, while modern FL frameworks increasingly employ adaptives for improved convergence.
arXiv Detail & Related papers (2025-02-04T21:21:30Z) - Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.
$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.
$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO)
PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Nonmyopic Gaussian Process Optimization with Macro-Actions [13.847308344546171]
This paper presents a multi-staged approach to nonmyopic adaptive Gaussian process optimization (GPO)
It exploits the notion of macro-actions for scaling up to a further lookahead to match up to a larger available budget.
We empirically evaluate the performance of our epsilon-Macro-GPO policy and its anytime variant in BO datasets with synthetic and real-world datasets.
arXiv Detail & Related papers (2020-02-22T09:56:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.