PTR-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay
- URL: http://arxiv.org/abs/2112.03798v2
- Date: Wed, 8 Dec 2021 02:12:33 GMT
- Title: PTR-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay
- Authors: Xingxing Liang and Yang Ma and Yanghe Feng and Zhong Liu
- Abstract summary: On-policy deep reinforcement learning algorithms have low data utilization and require significant experience for policy improvement.
This paper proposes a prioritized trajectory replay (PTR-PPO) that combines on-policy and off-policy methods to improve sampling efficiency.
We evaluate the performance of PTR-PPO in a set of Atari discrete control tasks, achieving state-of-the-art performance.
- Score: 4.0388304511445146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On-policy deep reinforcement learning algorithms have low data utilization
and require significant experience for policy improvement. This paper proposes
a proximal policy optimization algorithm with prioritized trajectory replay
(PTR-PPO) that combines on-policy and off-policy methods to improve sampling
efficiency by prioritizing the replay of trajectories generated by old
policies. We first design three trajectory priorities based on the
characteristics of trajectories: the first two being max and mean trajectory
priorities based on one-step empirical generalized advantage estimation (GAE)
values and the last being reward trajectory priorities based on normalized
undiscounted cumulative reward. Then, we incorporate the prioritized trajectory
replay into the PPO algorithm, propose a truncated importance weight method to
overcome the high variance caused by large importance weights under multistep
experience, and design a policy improvement loss function for PPO under
off-policy conditions. We evaluate the performance of PTR-PPO in a set of Atari
discrete control tasks, achieving state-of-the-art performance. In addition, by
analyzing the heatmap of priority changes at various locations in the priority
memory during training, we find that memory size and rollout length can have a
significant impact on the distribution of trajectory priorities and, hence, on
the performance of the algorithm.
Related papers
- A dynamical clipping approach with task feedback for Proximal Policy Optimization [29.855219523565786]
There is no theoretical proof that the optimal PPO clipping bound remains consistent throughout the entire training process.
Past studies have aimed to dynamically adjust PPO clipping bound to enhance PPO's performance.
We propose Preference based Proximal Policy Optimization (Pb-PPO) to better reflect the preference (maximizing Return) of reinforcement learning tasks.
arXiv Detail & Related papers (2023-12-12T06:35:56Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Dropout Strategy in Reinforcement Learning: Limiting the Surrogate
Objective Variance in Policy Optimization Methods [0.0]
Policy-based reinforcement learning algorithms are widely used in various fields.
These algorithms introduce importance sampling into policy iteration.
This can lead to a high variance of the surrogate objective and indirectly affects the stability and convergence of the algorithm.
arXiv Detail & Related papers (2023-10-31T11:38:26Z) - Decoupled Prioritized Resampling for Offline RL [120.49021589395005]
We propose Offline Prioritized Experience Replay (OPER) for offline reinforcement learning.
OPER features a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training.
We show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution.
arXiv Detail & Related papers (2023-06-08T17:56:46Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Proximal Deterministic Policy Gradient [20.951797549505986]
We introduce two techniques to improve off-policy Reinforcement Learning (RL) algorithms.
We exploit the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate.
We demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.
arXiv Detail & Related papers (2020-08-03T10:19:59Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.