PPO-UE: Proximal Policy Optimization via Uncertainty-Aware Exploration
- URL: http://arxiv.org/abs/2212.06343v1
- Date: Tue, 13 Dec 2022 02:51:43 GMT
- Title: PPO-UE: Proximal Policy Optimization via Uncertainty-Aware Exploration
- Authors: Qisheng Zhang, Zhen Guo, Audun J{\o}sang, Lance M. Kaplan, Feng Chen,
Dong H. Jeong, Jin-Hee Cho
- Abstract summary: We propose PPO-UE, a PPO variant equipped with self-adaptive uncertainty-aware explorations.
Our proposed PPO-UE considerably outperforms the baseline PPO in Roboschool continuous control tasks.
- Score: 14.17825337817933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Proximal Policy Optimization (PPO) is a highly popular policy-based deep
reinforcement learning (DRL) approach. However, we observe that the homogeneous
exploration process in PPO could cause an unexpected stability issue in the
training phase. To address this issue, we propose PPO-UE, a PPO variant
equipped with self-adaptive uncertainty-aware explorations (UEs) based on a
ratio uncertainty level. The proposed PPO-UE is designed to improve convergence
speed and performance with an optimized ratio uncertainty level. Through
extensive sensitivity analysis by varying the ratio uncertainty level, our
proposed PPO-UE considerably outperforms the baseline PPO in Roboschool
continuous control tasks.
Related papers
- PPO in the Fisher-Rao geometry [0.0]
Proximal Policy Optimization (PPO) has become a widely adopted algorithm for reinforcement learning.<n>Despite its popularity, PPO lacks formal theoretical guarantees for policy improvement and convergence.<n>In this paper, we derive a tighter surrogate in the Fisher-Rao geometry, yielding a novel variant, Fisher-Rao PPO (FR-PPO)
arXiv Detail & Related papers (2025-06-04T09:23:27Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Gradient Imbalance in Direct Preference Optimization [26.964127989679596]
We propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism.
Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance.
arXiv Detail & Related papers (2025-02-28T08:47:03Z) - Beyond the Boundaries of Proximal Policy Optimization [17.577317574595206]
This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors.
We propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based gradient.
Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumanji and MinAtar environments.
arXiv Detail & Related papers (2024-11-01T15:29:10Z) - Understanding Reference Policies in Direct Preference Optimization [50.67309013764383]
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs)
This work explores an under-investigated aspect of DPO - its dependency on the reference model or policy.
arXiv Detail & Related papers (2024-07-18T17:08:10Z) - Transductive Off-policy Proximal Policy Optimization [27.954910833441705]
This paper introduces a novel off-policy extension to the original PPO method, christened Transductive Off-policy PPO (ToPPO)
Our contribution includes a novel formulation of the policy improvement lower bound for prospective policies derived from off-policy data.
Comprehensive experimental results across six representative tasks underscore ToPPO's promising performance.
arXiv Detail & Related papers (2024-06-06T09:29:40Z) - Dropout Strategy in Reinforcement Learning: Limiting the Surrogate
Objective Variance in Policy Optimization Methods [0.0]
Policy-based reinforcement learning algorithms are widely used in various fields.
These algorithms introduce importance sampling into policy iteration.
This can lead to a high variance of the surrogate objective and indirectly affects the stability and convergence of the algorithm.
arXiv Detail & Related papers (2023-10-31T11:38:26Z) - Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences.
We show that it consistently outperforms PPO in language tasks by a large margin.
We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z) - Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL)
We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric [0.5409700620900998]
Proximal Policy Optimization (PPO) is a popular Deep Reinforcement Learning (DRL) algorithm.
In this paper, we analyze the impact of asymmetry in KL divergence on PPO-KL.
We show that the asymmetry of KL divergence can affect the policy improvement of PPO-KL and show that the PPO-CIM can perform better than both PPO-KL and PPO-Clip in most tasks.
arXiv Detail & Related papers (2021-10-20T12:20:52Z) - Offline Policy Selection under Uncertainty [113.57441913299868]
We consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset.
Access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics.
We show how BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric.
arXiv Detail & Related papers (2020-12-12T23:09:21Z) - Proximal Policy Optimization with Relative Pearson Divergence [8.071506311915396]
PPO clips density ratio of latest and baseline policies with a threshold, while its minimization target is unclear.
This paper proposes a new variant of PPO by considering a regularization problem of relative Pearson (RPE) divergence, so-called PPO-RPE.
Through four benchmark tasks, PPO-RPE performed as well as or better than the conventional methods in terms of the task performance by the learned policy.
arXiv Detail & Related papers (2020-10-07T09:11:22Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.