Related papers: Amortized Proximal Optimization

Amortized Proximal Optimization

URL: http://arxiv.org/abs/2203.00089v1
Date: Mon, 28 Feb 2022 20:50:48 GMT
Title: Amortized Proximal Optimization
Authors: Juhan Bae, Paul Vicol, Jeff Z. HaoChen, Roger Grosse
Abstract summary: Amortized Proximal Optimization (APO) is a framework for online meta-optimization of parameters that govern optimization. We show how APO can be used to adapt a learning rate or a structured preconditioning matrix. We empirically test APO for online adaptation of learning rates and structured preconditioning for regression, image reconstruction, image classification, and natural language translation tasks.
Score: 11.441395750267052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a framework for online meta-optimization of parameters that govern optimization, called Amortized Proximal Optimization (APO). We first interpret various existing neural network optimizers as approximate stochastic proximal point methods which trade off the current-batch loss with proximity terms in both function space and weight space. The idea behind APO is to amortize the minimization of the proximal point objective by meta-learning the parameters of an update rule. We show how APO can be used to adapt a learning rate or a structured preconditioning matrix. Under appropriate assumptions, APO can recover existing optimizers such as natural gradient descent and KFAC. It enjoys low computational overhead and avoids expensive and numerically sensitive operations required by some second-order optimizers, such as matrix inverses. We empirically test APO for online adaptation of learning rates and structured preconditioning matrices for regression, image reconstruction, image classification, and natural language translation tasks. Empirically, the learning rate schedules found by APO generally outperform optimal fixed learning rates and are competitive with manually tuned decay schedules. Using APO to adapt a structured preconditioning matrix generally results in optimization performance competitive with second-order methods. Moreover, the absence of matrix inversion provides numerical stability, making it effective for low precision training.

Related papers

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge [35.703451475662995]
We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences.<n>MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective.<n>MaPPO can be used as a plugin with consistent improvement on DPO variants.
arXiv Detail & Related papers (2025-07-27T05:26:50Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Tuning-Free Structured Sparse PCA via Deep Unfolding Networks [5.931547772157972]
We propose a new type of sparse principal component analysis (PCA) for unsupervised feature selection (UFS) We use an interpretable deep unfolding network that translates iterative optimization steps into trainable neural architectures. This innovation enables automatic learning of the regularization parameters, effectively bypassing the empirical tuning requirements of conventional methods.
arXiv Detail & Related papers (2025-02-28T08:32:51Z)
Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function. We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z)
Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z)
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
We present a new offline alignment algorithm, $chi2$-Preference Optimization ($chi$PO) $chi$PO implements the principle of pessimism in the face of uncertainty via regularization. It is provably robust to overoptimization and achieves sample-complexity guarantees based on single-policy concentrability.
arXiv Detail & Related papers (2024-07-18T11:08:40Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Generalized Preference Optimization: A Unified Approach to Offline Alignment [54.97015778517253]
We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases. Our results present new algorithmic toolkits and empirical insights to alignment practitioners.
arXiv Detail & Related papers (2024-02-08T15:33:09Z)
Preference as Reward, Maximum Preference Optimization with Importance Sampling [3.7040071165219595]
We propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO) MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm.
arXiv Detail & Related papers (2023-12-27T06:34:54Z)
Adaptive Neural Ranking Framework: Toward Maximized Business Goal for Cascade Ranking Systems [33.46891569350896]
Cascade ranking is widely used for large-scale top-k selection problems in online advertising and recommendation systems. Previous works on learning-to-rank usually focus on letting the model learn the complete order or top-k order. We name this method as Adaptive Neural Ranking Framework (abbreviated as ARF)
arXiv Detail & Related papers (2023-10-16T14:43:02Z)
Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework. We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z)
Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs) We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO) PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z)
Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control. From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly. We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z)
Adaptive pruning-based optimization of parameterized quantum circuits [62.997667081978825]
Variisy hybrid quantum-classical algorithms are powerful tools to maximize the use of Noisy Intermediate Scale Quantum devices. We propose a strategy for such ansatze used in variational quantum algorithms, which we call "Efficient Circuit Training" (PECT) Instead of optimizing all of the ansatz parameters at once, PECT launches a sequence of variational algorithms.
arXiv Detail & Related papers (2020-10-01T18:14:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.