Dropout Strategy in Reinforcement Learning: Limiting the Surrogate
Objective Variance in Policy Optimization Methods
- URL: http://arxiv.org/abs/2310.20380v3
- Date: Fri, 3 Nov 2023 04:12:09 GMT
- Title: Dropout Strategy in Reinforcement Learning: Limiting the Surrogate
Objective Variance in Policy Optimization Methods
- Authors: Zhengpeng Xie, Changdong Yu, Weizheng Qiao
- Abstract summary: Policy-based reinforcement learning algorithms are widely used in various fields.
These algorithms introduce importance sampling into policy iteration.
This can lead to a high variance of the surrogate objective and indirectly affects the stability and convergence of the algorithm.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy-based reinforcement learning algorithms are widely used in various
fields. Among them, mainstream policy optimization algorithms such as TRPO and
PPO introduce importance sampling into policy iteration, which allows the reuse
of historical data. However, this can also lead to a high variance of the
surrogate objective and indirectly affects the stability and convergence of the
algorithm. In this paper, we first derived an upper bound of the surrogate
objective variance, which can grow quadratically with the increase of the
surrogate objective. Next, we proposed the dropout technique to avoid the
excessive increase of the surrogate objective variance caused by importance
sampling. Then, we introduced a general reinforcement learning framework
applicable to mainstream policy optimization methods, and applied the dropout
technique to the PPO algorithm to obtain the D-PPO variant. Finally, we conduct
comparative experiments between D-PPO and PPO algorithms in the Atari 2600
environment, and the results show that D-PPO achieved significant performance
improvements compared to PPO, and effectively limited the excessive increase of
the surrogate objective variance during training.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Adversarial Style Transfer for Robust Policy Optimization in Deep
Reinforcement Learning [13.652106087606471]
This paper proposes an algorithm that aims to improve generalization for reinforcement learning agents by removing overfitting to confounding features.
A policy network updates its parameters to minimize the effect of such perturbations, thus staying robust while maximizing the expected future reward.
We evaluate our approach on Procgen and Distracting Control Suite for generalization and sample efficiency.
arXiv Detail & Related papers (2023-08-29T18:17:35Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - PTR-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay [4.0388304511445146]
On-policy deep reinforcement learning algorithms have low data utilization and require significant experience for policy improvement.
This paper proposes a prioritized trajectory replay (PTR-PPO) that combines on-policy and off-policy methods to improve sampling efficiency.
We evaluate the performance of PTR-PPO in a set of Atari discrete control tasks, achieving state-of-the-art performance.
arXiv Detail & Related papers (2021-12-07T16:15:13Z) - Hinge Policy Optimization: Rethinking Policy Improvement and
Reinterpreting PPO [6.33198867705718]
Policy optimization is a fundamental principle for designing reinforcement learning algorithms.
Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date.
This is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip.
arXiv Detail & Related papers (2021-10-26T15:56:57Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.