Greedification Operators for Policy Optimization: Investigating Forward
and Reverse KL Divergences
- URL: http://arxiv.org/abs/2107.08285v1
- Date: Sat, 17 Jul 2021 17:09:18 GMT
- Title: Greedification Operators for Policy Optimization: Investigating Forward
and Reverse KL Divergences
- Authors: Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Rupam Mahmood,
Martha White
- Abstract summary: We investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values.
We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy.
No significant differences were observed in the discrete-action setting or on a suite of benchmark problems.
- Score: 33.471102483095315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Approximate Policy Iteration (API) algorithms alternate between (approximate)
policy evaluation and (approximate) greedification. Many different approaches
have been explored for approximate policy evaluation, but less is understood
about approximate greedification and what choices guarantee policy improvement.
In this work, we investigate approximate greedification when reducing the KL
divergence between the parameterized policy and the Boltzmann distribution over
action values. In particular, we investigate the difference between the forward
and reverse KL divergences, with varying degrees of entropy regularization. We
show that the reverse KL has stronger policy improvement guarantees, but that
reducing the forward KL can result in a worse policy. We also demonstrate,
however, that a large enough reduction of the forward KL can induce improvement
under additional assumptions. Empirically, we show on simple continuous-action
environments that the forward KL can induce more exploration, but at the cost
of a more suboptimal policy. No significant differences were observed in the
discrete-action setting or on a suite of benchmark problems. Throughout, we
highlight that many policy gradient methods can be seen as an instance of API,
with either the forward or reverse KL for the policy update, and discuss next
steps for understanding and improving our policy optimization algorithms.
Related papers
- WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence [22.400759435696102]
We investigate a generalized KL divergence, called the Tsallis KL divergence, which use the $q$-logarithm in the definition.
We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial.
We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
arXiv Detail & Related papers (2023-01-27T00:31:51Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z) - Optimization Issues in KL-Constrained Approximate Policy Iteration [48.24321346619156]
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API)
While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy.
Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
arXiv Detail & Related papers (2021-02-11T19:35:33Z) - Policy Optimization as Online Learning with Mediator Feedback [46.845765216238135]
Policy Optimization (PO) is a widely used approach to address continuous control tasks.
In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space.
We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
arXiv Detail & Related papers (2020-12-15T11:34:29Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.