Proximal Policy Optimization with Adaptive Threshold for Symmetric
Relative Density Ratio
- URL: http://arxiv.org/abs/2203.09809v1
- Date: Fri, 18 Mar 2022 09:13:13 GMT
- Title: Proximal Policy Optimization with Adaptive Threshold for Symmetric
Relative Density Ratio
- Authors: Taisuke Kobayashi
- Abstract summary: A popular method, so-called policy optimization (PPO), and its variants constrain density ratio of the latest and baseline policies when the density ratio exceeds a given threshold.
This paper proposes a new PPO derived using relative Pearson (RPE) divergence, therefore so-called PPO-RPE, to design the threshold adaptively.
- Score: 8.071506311915396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep reinforcement learning (DRL) is one of the promising approaches for
introducing robots into complicated environments. The recent remarkable
progress of DRL stands on regularization of policy, which allows the policy to
improve stably and efficiently. A popular method, so-called proximal policy
optimization (PPO), and its variants constrain density ratio of the latest and
baseline policies when the density ratio exceeds a given threshold. This
threshold can be designed relatively intuitively, and in fact its recommended
value range has been suggested. However, the density ratio is asymmetric for
its center, and the possible error scale from its center, which should be close
to the threshold, would depend on how the baseline policy is given. In order to
maximize the values of regularization of policy, this paper proposes a new PPO
derived using relative Pearson (RPE) divergence, therefore so-called PPO-RPE,
to design the threshold adaptively. In PPO-RPE, the relative density ratio,
which can be formed with symmetry, replaces the raw density ratio. Thanks to
this symmetry, its error scale from center can easily be estimated, hence, the
threshold can be adapted for the estimated error scale. From three simple
benchmark simulations, the importance of algorithm-dependent threshold design
is revealed. By simulating additional four locomotion tasks, it is verified
that the proposed method statistically contributes to task accomplishment by
appropriately restricting the policy updates.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems [1.747623282473278]
We introduce a policygradient method for model reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from decision processes (MDPs) in networks.
Specifically, when the stationary distribution of the MDP is parametrized by policy parameters, we can improve existing policy methods for average-reward estimation.
arXiv Detail & Related papers (2023-12-05T14:44:58Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - Optimal Estimation of Off-Policy Policy Gradient via Double Fitted
Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy.
Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance.
In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z) - On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z) - Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z) - Proximal Policy Optimization with Relative Pearson Divergence [8.071506311915396]
PPO clips density ratio of latest and baseline policies with a threshold, while its minimization target is unclear.
This paper proposes a new variant of PPO by considering a regularization problem of relative Pearson (RPE) divergence, so-called PPO-RPE.
Through four benchmark tasks, PPO-RPE performed as well as or better than the conventional methods in terms of the task performance by the learned policy.
arXiv Detail & Related papers (2020-10-07T09:11:22Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.