Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence
- URL: http://arxiv.org/abs/2301.11476v4
- Date: Mon, 18 Mar 2024 15:53:34 GMT
- Title: Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence
- Authors: Lingwei Zhu, Zheng Chen, Matthew Schlegel, Martha White,
- Abstract summary: We investigate a generalized KL divergence, called the Tsallis KL divergence, which use the $q$-logarithm in the definition.
We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial.
We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
- Score: 22.400759435696102
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
Related papers
- WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the base policy is equal to $log (n) - (n-1)/n.$
We propose a new estimator for the KL divergence and empirically show that it provides a tight approximation through a few examples.
arXiv Detail & Related papers (2024-01-03T18:39:13Z) - Greedification Operators for Policy Optimization: Investigating Forward
and Reverse KL Divergences [33.471102483095315]
We investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values.
We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy.
No significant differences were observed in the discrete-action setting or on a suite of benchmark problems.
arXiv Detail & Related papers (2021-07-17T17:09:18Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Optimization Issues in KL-Constrained Approximate Policy Iteration [48.24321346619156]
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API)
While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy.
Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
arXiv Detail & Related papers (2021-02-11T19:35:33Z) - When Will Generative Adversarial Imitation Learning Algorithms Attain
Global Convergence [56.40794592158596]
We study generative adversarial imitation learning (GAIL) under general MDP and for nonlinear reward function classes.
This is the first systematic theoretical study of GAIL for global convergence.
arXiv Detail & Related papers (2020-06-24T06:24:37Z) - Markovian Score Climbing: Variational Inference with KL(p||q) [16.661889249333676]
We develop a simple algorithm for reliably minimizing the "exclusive Kullback-Leibler (KL)" KL(p q)
This method converges to a local optimum of the inclusive KL.
It does not suffer from the systematic errors inherent in existing methods, such as Reweighted Wake-Sleep and Neural Adaptive Monte Carlo.
arXiv Detail & Related papers (2020-03-23T16:38:10Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z) - Differentiable Bandit Exploration [38.81737411000074]
We learn such policies for an unknown distribution $mathcalP$ using samples from $mathcalP$.
Our approach is a form of meta-learning and exploits properties of $mathcalP$ without making strong assumptions about its form.
arXiv Detail & Related papers (2020-02-17T05:07:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.