Optimization Issues in KL-Constrained Approximate Policy Iteration
- URL: http://arxiv.org/abs/2102.06234v1
- Date: Thu, 11 Feb 2021 19:35:33 GMT
- Title: Optimization Issues in KL-Constrained Approximate Policy Iteration
- Authors: Nevena Lazi\'c, Botao Hao, Yasin Abbasi-Yadkori, Dale Schuurmans,
Csaba Szepesv\'ari
- Abstract summary: Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API)
While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy.
Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
- Score: 48.24321346619156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many reinforcement learning algorithms can be seen as versions of approximate
policy iteration (API). While standard API often performs poorly, it has been
shown that learning can be stabilized by regularizing each policy update by the
KL-divergence to the previous policy. Popular practical algorithms such as
TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of
consecutive policies, arguing that this is easier to implement and tune. In
this work, we study this implementation choice in more detail. We compare the
use of KL divergence as a constraint vs. as a regularizer, and point out
several optimization issues with the widely-used constrained approach. We show
that the constrained algorithm is not guaranteed to converge even on simple
problem instances where the constrained problem can be solved exactly, and in
fact incurs linear expected regret. With approximate implementation using
softmax policies, we show that regularization can improve the optimization
landscape of the original objective. We demonstrate these issues empirically on
several bandit and RL environments.
Related papers
- Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy.
Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum
Markov Games [10.805520579293747]
We show that a simple variant of naive policy iteration for games converges exponentially fast.
We also show that lookahead policies can be implemented efficiently in the function approximation setting of linear Markov games.
arXiv Detail & Related papers (2023-03-17T01:20:22Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - A policy gradient approach for Finite Horizon Constrained Markov Decision Processes [6.682382456607199]
We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time.
To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints.
arXiv Detail & Related papers (2022-10-10T09:52:02Z) - Policy Optimization for Stochastic Shortest Path [43.2288319750466]
We study policy optimization for the shortest path (SSP) problem.
We propose a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model.
For most settings, our algorithm is shown to achieve a near-optimal regret bound.
arXiv Detail & Related papers (2022-02-07T16:25:14Z) - Greedification Operators for Policy Optimization: Investigating Forward
and Reverse KL Divergences [33.471102483095315]
We investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values.
We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy.
No significant differences were observed in the discrete-action setting or on a suite of benchmark problems.
arXiv Detail & Related papers (2021-07-17T17:09:18Z) - Policy Optimization as Online Learning with Mediator Feedback [46.845765216238135]
Policy Optimization (PO) is a widely used approach to address continuous control tasks.
In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space.
We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
arXiv Detail & Related papers (2020-12-15T11:34:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.