Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence
Optimization
- URL: http://arxiv.org/abs/2105.12991v1
- Date: Thu, 27 May 2021 08:24:51 GMT
- Title: Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence
Optimization
- Authors: Taisuke Kobayashi
- Abstract summary: This paper addresses a new interpretation of reinforcement learning (RL) as reverse Kullback-Leibler (KL) divergence optimization.
It derives a new optimization method using forward KL divergence.
In a realistic robotic simulation, the proposed method with the moderate optimism outperformed one of the state-of-the-art RL method.
- Score: 1.7970523486905976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses a new interpretation of reinforcement learning (RL) as
reverse Kullback-Leibler (KL) divergence optimization, and derives a new
optimization method using forward KL divergence. Although RL originally aims to
maximize return indirectly through optimization of policy, the recent work by
Levine has proposed a different derivation process with explicit consideration
of optimality as stochastic variable. This paper follows this concept and
formulates the traditional learning laws for both value function and policy as
the optimization problems with reverse KL divergence including optimality.
Focusing on the asymmetry of KL divergence, the new optimization problems with
forward KL divergence are derived. Remarkably, such new optimization problems
can be regarded as optimistic RL. That optimism is intuitively specified by a
hyperparameter converted from an uncertainty parameter. In addition, it can be
enhanced when it is integrated with prioritized experience replay and
eligibility traces, both of which accelerate learning. The effects of this
expected optimism was investigated through learning tendencies on numerical
simulations using Pybullet. As a result, moderate optimism accelerated learning
and yielded higher rewards. In a realistic robotic simulation, the proposed
method with the moderate optimism outperformed one of the state-of-the-art RL
method.
Related papers
- A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - Combining Automated Optimisation of Hyperparameters and Reward Shape [7.407166175374958]
We propose a methodology for the combined optimisation of hyperparameters and the reward function.
We conducted extensive experiments using Proximal Policy optimisation and Soft Actor-Critic.
Our results show that combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others.
arXiv Detail & Related papers (2024-06-26T12:23:54Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - End-to-End Learning for Fair Multiobjective Optimization Under
Uncertainty [55.04219793298687]
The Predict-Then-Forecast (PtO) paradigm in machine learning aims to maximize downstream decision quality.
This paper extends the PtO methodology to optimization problems with nondifferentiable Ordered Weighted Averaging (OWA) objectives.
It shows how optimization of OWA functions can be effectively integrated with parametric prediction for fair and robust optimization under uncertainty.
arXiv Detail & Related papers (2024-02-12T16:33:35Z) - Analyzing and Enhancing the Backward-Pass Convergence of Unrolled
Optimization [50.38518771642365]
The integration of constrained optimization models as components in deep networks has led to promising advances on many specialized learning tasks.
A central challenge in this setting is backpropagation through the solution of an optimization problem, which often lacks a closed form.
This paper provides theoretical insights into the backward pass of unrolled optimization, showing that it is equivalent to the solution of a linear system by a particular iterative method.
A system called Folded Optimization is proposed to construct more efficient backpropagation rules from unrolled solver implementations.
arXiv Detail & Related papers (2023-12-28T23:15:18Z) - Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant
Fuel Optimization [0.0]
This work presents a first-of-a-kind approach to utilize deep RL to solve the loading pattern problem and could be leveraged for any engineering design optimization.
arXiv Detail & Related papers (2023-05-09T23:51:24Z) - Accelerating the Evolutionary Algorithms by Gaussian Process Regression
with $\epsilon$-greedy acquisition function [2.7716102039510564]
We propose a novel method to estimate the elite individual to accelerate the convergence of optimization.
Our proposal has a broad prospect to estimate the elite individual and accelerate the convergence of optimization.
arXiv Detail & Related papers (2022-10-13T07:56:47Z) - Teaching Networks to Solve Optimization Problems [13.803078209630444]
We propose to replace the iterative solvers altogether with a trainable parametric set function.
We show the feasibility of learning such parametric (set) functions to solve various classic optimization problems.
arXiv Detail & Related papers (2022-02-08T19:13:13Z) - Better call Surrogates: A hybrid Evolutionary Algorithm for
Hyperparameter optimization [18.359749929678635]
We propose a surrogate-assisted evolutionary algorithm (EA) for hyper parameter optimization of machine learning (ML) models.
The proposed STEADE model initially estimates the objective function landscape using RadialBasis Function, and then transfers the knowledge to an EA technique called Differential Evolution.
We empirically evaluate our model on the hyper parameter optimization problems as a part of the black box optimization challenge at NeurIPS 2020 and demonstrate the improvement brought about by STEADE over the vanilla EA.
arXiv Detail & Related papers (2020-12-11T16:19:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.