Amortized Proximal Optimization
- URL: http://arxiv.org/abs/2203.00089v1
- Date: Mon, 28 Feb 2022 20:50:48 GMT
- Title: Amortized Proximal Optimization
- Authors: Juhan Bae, Paul Vicol, Jeff Z. HaoChen, Roger Grosse
- Abstract summary: Amortized Proximal Optimization (APO) is a framework for online meta-optimization of parameters that govern optimization.
We show how APO can be used to adapt a learning rate or a structured preconditioning matrix.
We empirically test APO for online adaptation of learning rates and structured preconditioning for regression, image reconstruction, image classification, and natural language translation tasks.
- Score: 11.441395750267052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a framework for online meta-optimization of parameters that govern
optimization, called Amortized Proximal Optimization (APO). We first interpret
various existing neural network optimizers as approximate stochastic proximal
point methods which trade off the current-batch loss with proximity terms in
both function space and weight space. The idea behind APO is to amortize the
minimization of the proximal point objective by meta-learning the parameters of
an update rule. We show how APO can be used to adapt a learning rate or a
structured preconditioning matrix. Under appropriate assumptions, APO can
recover existing optimizers such as natural gradient descent and KFAC. It
enjoys low computational overhead and avoids expensive and numerically
sensitive operations required by some second-order optimizers, such as matrix
inverses. We empirically test APO for online adaptation of learning rates and
structured preconditioning matrices for regression, image reconstruction, image
classification, and natural language translation tasks. Empirically, the
learning rate schedules found by APO generally outperform optimal fixed
learning rates and are competitive with manually tuned decay schedules. Using
APO to adapt a structured preconditioning matrix generally results in
optimization performance competitive with second-order methods. Moreover, the
absence of matrix inversion provides numerical stability, making it effective
for low precision training.
Related papers
- Faster Adaptive Optimization via Expected Gradient Outer Product Reparameterization [11.394969272703014]
We show that for a broad class of functions, the sensitivity of adaptive algorithms to choice-of-basis is influenced by the decay of the EGOP matrix spectrum.
arXiv Detail & Related papers (2025-02-03T18:26:35Z) - Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task.
Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method.
In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z) - Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.
$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.
$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Generalized Preference Optimization: A Unified Approach to Offline Alignment [54.97015778517253]
We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions.
GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases.
Our results present new algorithmic toolkits and empirical insights to alignment practitioners.
arXiv Detail & Related papers (2024-02-08T15:33:09Z) - Preference as Reward, Maximum Preference Optimization with Importance Sampling [3.7040071165219595]
We propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO)
MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm.
arXiv Detail & Related papers (2023-12-27T06:34:54Z) - Adaptive Neural Ranking Framework: Toward Maximized Business Goal for
Cascade Ranking Systems [33.46891569350896]
Cascade ranking is widely used for large-scale top-k selection problems in online advertising and recommendation systems.
Previous works on learning-to-rank usually focus on letting the model learn the complete order or top-k order.
We name this method as Adaptive Neural Ranking Framework (abbreviated as ARF)
arXiv Detail & Related papers (2023-10-16T14:43:02Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO)
PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.