Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence
- URL: http://arxiv.org/abs/2301.11476v4
- Date: Mon, 18 Mar 2024 15:53:34 GMT
- Title: Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence
- Authors: Lingwei Zhu, Zheng Chen, Matthew Schlegel, Martha White,
- Abstract summary: We investigate a generalized KL divergence, called the Tsallis KL divergence, which use the $q$-logarithm in the definition.
We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial.
We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
- Score: 22.400759435696102
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
Related papers
- KL Penalty Control via Perturbation for Direct Preference Optimization [53.67494512877768]
We propose $varepsilon$-Direct Preference Optimization ($varepsilon$-DPO), which allows adaptive control of the KL penalty strength $beta$ for each preference pair.
Experimental results show that $varepsilon$-DPO outperforms existing direct alignment algorithms.
arXiv Detail & Related papers (2025-02-18T06:44:10Z) - Logarithmic Regret for Online KL-Regularized Reinforcement Learning [51.113248212150964]
KL-regularization plays a pivotal role in improving efficiency of RL fine-tuning for large language models.
Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored.
We propose an optimistic-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret.
arXiv Detail & Related papers (2025-02-11T11:11:05Z) - Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability [49.96531901205305]
We propose the emphfirst algorithm with $tildeO(epsilon-1)$ sample complexity under single-policy concentrability for offline contextual bandits.
Our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator.
We extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.
arXiv Detail & Related papers (2025-02-09T22:14:45Z) - WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the reference policy is an upper bound on the actual KL divergence.
We also propose a new estimator for the KL divergence and empirically show that it provides a tight approximation.
We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy.
arXiv Detail & Related papers (2024-01-03T18:39:13Z) - Greedification Operators for Policy Optimization: Investigating Forward
and Reverse KL Divergences [33.471102483095315]
We investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values.
We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy.
No significant differences were observed in the discrete-action setting or on a suite of benchmark problems.
arXiv Detail & Related papers (2021-07-17T17:09:18Z) - Optimization Issues in KL-Constrained Approximate Policy Iteration [48.24321346619156]
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API)
While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy.
Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
arXiv Detail & Related papers (2021-02-11T19:35:33Z) - Markovian Score Climbing: Variational Inference with KL(p||q) [16.661889249333676]
We develop a simple algorithm for reliably minimizing the "exclusive Kullback-Leibler (KL)" KL(p q)
This method converges to a local optimum of the inclusive KL.
It does not suffer from the systematic errors inherent in existing methods, such as Reweighted Wake-Sleep and Neural Adaptive Monte Carlo.
arXiv Detail & Related papers (2020-03-23T16:38:10Z) - Differentiable Bandit Exploration [38.81737411000074]
We learn such policies for an unknown distribution $mathcalP$ using samples from $mathcalP$.
Our approach is a form of meta-learning and exploits properties of $mathcalP$ without making strong assumptions about its form.
arXiv Detail & Related papers (2020-02-17T05:07:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.