Optimization Issues in KL-Constrained Approximate Policy Iteration
- URL: http://arxiv.org/abs/2102.06234v1
- Date: Thu, 11 Feb 2021 19:35:33 GMT
- Title: Optimization Issues in KL-Constrained Approximate Policy Iteration
- Authors: Nevena Lazi\'c, Botao Hao, Yasin Abbasi-Yadkori, Dale Schuurmans,
Csaba Szepesv\'ari
- Abstract summary: Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API)
While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy.
Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
- Score: 48.24321346619156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many reinforcement learning algorithms can be seen as versions of approximate
policy iteration (API). While standard API often performs poorly, it has been
shown that learning can be stabilized by regularizing each policy update by the
KL-divergence to the previous policy. Popular practical algorithms such as
TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of
consecutive policies, arguing that this is easier to implement and tune. In
this work, we study this implementation choice in more detail. We compare the
use of KL divergence as a constraint vs. as a regularizer, and point out
several optimization issues with the widely-used constrained approach. We show
that the constrained algorithm is not guaranteed to converge even on simple
problem instances where the constrained problem can be solved exactly, and in
fact incurs linear expected regret. With approximate implementation using
softmax policies, we show that regularization can improve the optimization
landscape of the original objective. We demonstrate these issues empirically on
several bandit and RL environments.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Simple Policy Optimization [7.228021064624876]
We introduce an algorithm named SPO, which incorporates a novel clipping method for the KL divergence between the old and new policies.
SPO achieves better sample efficiency, extremely low KL divergence, and higher policy entropy, while also being robust to increases in network depth or complexity.
arXiv Detail & Related papers (2024-01-29T10:17:54Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum
Markov Games [10.805520579293747]
We show that a simple variant of naive policy iteration for games converges exponentially fast.
We also show that lookahead policies can be implemented efficiently in the function approximation setting of linear Markov games.
arXiv Detail & Related papers (2023-03-17T01:20:22Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - A policy gradient approach for Finite Horizon Constrained Markov Decision Processes [6.682382456607199]
We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time.
To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints.
arXiv Detail & Related papers (2022-10-10T09:52:02Z) - Policy Optimization for Stochastic Shortest Path [43.2288319750466]
We study policy optimization for the shortest path (SSP) problem.
We propose a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model.
For most settings, our algorithm is shown to achieve a near-optimal regret bound.
arXiv Detail & Related papers (2022-02-07T16:25:14Z) - Policy Optimization as Online Learning with Mediator Feedback [46.845765216238135]
Policy Optimization (PO) is a widely used approach to address continuous control tasks.
In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space.
We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
arXiv Detail & Related papers (2020-12-15T11:34:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.