Cautious Policy Programming: Exploiting KL Regularization in Monotonic
Policy Improvement for Reinforcement Learning
- URL: http://arxiv.org/abs/2107.05798v1
- Date: Tue, 13 Jul 2021 01:03:10 GMT
- Title: Cautious Policy Programming: Exploiting KL Regularization in Monotonic
Policy Improvement for Reinforcement Learning
- Authors: Lingwei Zhu, Toshinori Kitamura, Takamitsu Matsubara
- Abstract summary: We propose a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning.
We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
- Score: 11.82492300303637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose cautious policy programming (CPP), a novel
value-based reinforcement learning (RL) algorithm that can ensure monotonic
policy improvement during learning. Based on the nature of entropy-regularized
RL, we derive a new entropy regularization-aware lower bound of policy
improvement that only requires estimating the expected policy advantage
function. CPP leverages this lower bound as a criterion for adjusting the
degree of a policy update for alleviating policy oscillation. Different from
similar algorithms that are mostly theory-oriented, we also propose a novel
interpolation scheme that makes CPP better scale in high dimensional control
problems. We demonstrate that the proposed algorithm can trade o? performance
and stability in both didactic classic control problems and challenging
high-dimensional Atari games.
Related papers
- Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning [0.46040036610482665]
Cumulative Prospect Theory (CPT) has been developed to provide a better model for human-based decision-making supported by empirical evidence.
A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem.
We show that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.
arXiv Detail & Related papers (2024-10-03T15:45:39Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees [8.610425739792284]
We revisit the domain of off-policy policy optimization in RL.
One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective.
This approach has been shown to suffer from the distribution mismatch issue.
arXiv Detail & Related papers (2022-12-10T07:47:04Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Hinge Policy Optimization: Rethinking Policy Improvement and
Reinterpreting PPO [6.33198867705718]
Policy optimization is a fundamental principle for designing reinforcement learning algorithms.
Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date.
This is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip.
arXiv Detail & Related papers (2021-10-26T15:56:57Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based
Reinforcement Learning [14.325835899564664]
entropy-regularized value-based reinforcement learning method can ensure the monotonic improvement of policies at each policy update.
We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.
arXiv Detail & Related papers (2020-08-25T04:09:18Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.