Related papers: Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization

Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization

URL: http://arxiv.org/abs/2105.12991v1
Date: Thu, 27 May 2021 08:24:51 GMT
Title: Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization
Authors: Taisuke Kobayashi
Abstract summary: This paper addresses a new interpretation of reinforcement learning (RL) as reverse Kullback-Leibler (KL) divergence optimization. It derives a new optimization method using forward KL divergence. In a realistic robotic simulation, the proposed method with the moderate optimism outperformed one of the state-of-the-art RL method.
Score: 1.7970523486905976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper addresses a new interpretation of reinforcement learning (RL) as reverse Kullback-Leibler (KL) divergence optimization, and derives a new optimization method using forward KL divergence. Although RL originally aims to maximize return indirectly through optimization of policy, the recent work by Levine has proposed a different derivation process with explicit consideration of optimality as stochastic variable. This paper follows this concept and formulates the traditional learning laws for both value function and policy as the optimization problems with reverse KL divergence including optimality. Focusing on the asymmetry of KL divergence, the new optimization problems with forward KL divergence are derived. Remarkably, such new optimization problems can be regarded as optimistic RL. That optimism is intuitively specified by a hyperparameter converted from an uncertainty parameter. In addition, it can be enhanced when it is integrated with prioritized experience replay and eligibility traces, both of which accelerate learning. The effects of this expected optimism was investigated through learning tendencies on numerical simulations using Pybullet. As a result, moderate optimism accelerated learning and yielded higher rewards. In a realistic robotic simulation, the proposed method with the moderate optimism outperformed one of the state-of-the-art RL method.

Related papers

Advancing CMA-ES with Learning-Based Cooperative Coevolution for Scalable Optimization [12.899626317088885]
This paper introduces LCC, a pioneering learning-based cooperative coevolution framework. LCC dynamically schedules decomposition strategies during optimization processes. It offers certain advantages over state-of-the-art baselines in terms of optimization effectiveness and resource consumption.
arXiv Detail & Related papers (2025-04-24T14:09:22Z)
A Novel Unified Parametric Assumption for Nonconvex Optimization [53.943470475510196]
Non optimization is central to machine learning, but the general framework non convexity enables weak convergence guarantees too pessimistic compared to the other hand. We introduce a novel unified assumption in non convex algorithms.
arXiv Detail & Related papers (2025-02-17T21:25:31Z)
Align-Pro: A Principled Approach to Prompt Optimization for LLM Alignment [40.71270945505082]
Large language models (LLMs) are increasingly integrated into various societal and decision-making processes. Traditional methods, such as reinforcement learning from human feedback (RLHF), achieve alignment by fine-tuning model parameters. In contrast, prompt optimization is a viable alternative to RLHF for LLM alignment.
arXiv Detail & Related papers (2025-01-07T03:14:39Z)
A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning. These problems are often formalized as Bi-Level optimizations (BLO) We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z)
Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function. We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z)
Combining Automated Optimisation of Hyperparameters and Reward Shape [7.407166175374958]
We propose a methodology for the combined optimisation of hyperparameters and the reward function. We conducted extensive experiments using Proximal Policy optimisation and Soft Actor-Critic. Our results show that combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others.
arXiv Detail & Related papers (2024-06-26T12:23:54Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
End-to-End Learning for Fair Multiobjective Optimization Under Uncertainty [55.04219793298687]
The Predict-Then-Forecast (PtO) paradigm in machine learning aims to maximize downstream decision quality. This paper extends the PtO methodology to optimization problems with nondifferentiable Ordered Weighted Averaging (OWA) objectives. It shows how optimization of OWA functions can be effectively integrated with parametric prediction for fair and robust optimization under uncertainty.
arXiv Detail & Related papers (2024-02-12T16:33:35Z)
Analyzing and Enhancing the Backward-Pass Convergence of Unrolled Optimization [50.38518771642365]
The integration of constrained optimization models as components in deep networks has led to promising advances on many specialized learning tasks. A central challenge in this setting is backpropagation through the solution of an optimization problem, which often lacks a closed form. This paper provides theoretical insights into the backward pass of unrolled optimization, showing that it is equivalent to the solution of a linear system by a particular iterative method. A system called Folded Optimization is proposed to construct more efficient backpropagation rules from unrolled solver implementations.
arXiv Detail & Related papers (2023-12-28T23:15:18Z)
Assessment of Reinforcement Learning Algorithms for Nuclear Power Plant Fuel Optimization [0.0]
This work presents a first-of-a-kind approach to utilize deep RL to solve the loading pattern problem and could be leveraged for any engineering design optimization.
arXiv Detail & Related papers (2023-05-09T23:51:24Z)
Accelerating the Evolutionary Algorithms by Gaussian Process Regression with $\epsilon$-greedy acquisition function [2.7716102039510564]
We propose a novel method to estimate the elite individual to accelerate the convergence of optimization. Our proposal has a broad prospect to estimate the elite individual and accelerate the convergence of optimization.
arXiv Detail & Related papers (2022-10-13T07:56:47Z)
Teaching Networks to Solve Optimization Problems [13.803078209630444]
We propose to replace the iterative solvers altogether with a trainable parametric set function. We show the feasibility of learning such parametric (set) functions to solve various classic optimization problems.
arXiv Detail & Related papers (2022-02-08T19:13:13Z)
Better call Surrogates: A hybrid Evolutionary Algorithm for Hyperparameter optimization [18.359749929678635]
We propose a surrogate-assisted evolutionary algorithm (EA) for hyper parameter optimization of machine learning (ML) models. The proposed STEADE model initially estimates the objective function landscape using RadialBasis Function, and then transfers the knowledge to an EA technique called Differential Evolution. We empirically evaluate our model on the hyper parameter optimization problems as a part of the black box optimization challenge at NeurIPS 2020 and demonstrate the improvement brought about by STEADE over the vanilla EA.
arXiv Detail & Related papers (2020-12-11T16:19:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.