CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric
- URL: http://arxiv.org/abs/2110.10522v1
- Date: Wed, 20 Oct 2021 12:20:52 GMT
- Title: CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric
- Authors: Yunxiao Guo, Han Long, Xiaojun Duan, Kaiyuan Feng, Maochu Li, Xiaying
Ma
- Abstract summary: As an algorithm based on deep reinforcement learning, Proximal Policy Optimization (PPO) performs well in many complex tasks.
Clip-PPO is widely used in a variety of practical scenarios and has attracted the attention of many researchers.
As a more theoretical algorithm, KL-PPO was neglected because its performance was not as good as CliP-PPO.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As an algorithm based on deep reinforcement learning, Proximal Policy
Optimization (PPO) performs well in many complex tasks and has become one of
the most popular RL algorithms in recent years. According to the mechanism of
penalty in surrogate objective, PPO can be divided into PPO with KL Divergence
(KL-PPO) and PPO with Clip function(Clip-PPO). Clip-PPO is widely used in a
variety of practical scenarios and has attracted the attention of many
researchers. Therefore, many variations have also been created, making the
algorithm better and better. However, as a more theoretical algorithm, KL-PPO
was neglected because its performance was not as good as CliP-PPO. In this
article, we analyze the asymmetry effect of KL divergence on PPO's objective
function , and give the inequality that can indicate when the asymmetry will
affect the efficiency of KL-PPO. Proposed PPO with Correntropy Induced Metric
algorithm(CIM-PPO) that use the theory of correntropy(a symmetry metric method
that was widely used in M-estimation to evaluate two distributions'
difference)and applied it in PPO. Then, we designed experiments based on
OpenAIgym to test the effectiveness of the new algorithm and compare it with
KL-PPO and CliP-PPO.
Related papers
- TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization [73.16975077770765]
Recent advancements in reinforcement learning have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO)<n>It is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO)<n>This work decomposes PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance.
arXiv Detail & Related papers (2025-06-17T14:30:06Z) - PPO in the Fisher-Rao geometry [0.0]
Proximal Policy Optimization (PPO) has become a widely adopted algorithm for reinforcement learning.<n>Despite its popularity, PPO lacks formal theoretical guarantees for policy improvement and convergence.<n>In this paper, we derive a tighter surrogate in the Fisher-Rao geometry, yielding a novel variant, Fisher-Rao PPO (FR-PPO)
arXiv Detail & Related papers (2025-06-04T09:23:27Z) - Understanding Reference Policies in Direct Preference Optimization [50.67309013764383]
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs)
This work explores an under-investigated aspect of DPO - its dependency on the reference model or policy.
arXiv Detail & Related papers (2024-07-18T17:08:10Z) - Transductive Off-policy Proximal Policy Optimization [27.954910833441705]
This paper introduces a novel off-policy extension to the original PPO method, christened Transductive Off-policy PPO (ToPPO)
Our contribution includes a novel formulation of the policy improvement lower bound for prospective policies derived from off-policy data.
Comprehensive experimental results across six representative tasks underscore ToPPO's promising performance.
arXiv Detail & Related papers (2024-06-06T09:29:40Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - A dynamical clipping approach with task feedback for Proximal Policy Optimization [29.855219523565786]
There is no theoretical proof that the optimal PPO clipping bound remains consistent throughout the entire training process.
Past studies have aimed to dynamically adjust PPO clipping bound to enhance PPO's performance.
We propose Preference based Proximal Policy Optimization (Pb-PPO) to better reflect the preference (maximizing Return) of reinforcement learning tasks.
arXiv Detail & Related papers (2023-12-12T06:35:56Z) - PPO-UE: Proximal Policy Optimization via Uncertainty-Aware Exploration [14.17825337817933]
We propose PPO-UE, a PPO variant equipped with self-adaptive uncertainty-aware explorations.
Our proposed PPO-UE considerably outperforms the baseline PPO in Roboschool continuous control tasks.
arXiv Detail & Related papers (2022-12-13T02:51:43Z) - Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL)
We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Hinge Policy Optimization: Rethinking Policy Improvement and
Reinterpreting PPO [6.33198867705718]
Policy optimization is a fundamental principle for designing reinforcement learning algorithms.
Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date.
This is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip.
arXiv Detail & Related papers (2021-10-26T15:56:57Z) - Permutation Invariant Policy Optimization for Mean-Field Multi-Agent
Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence.
In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z) - On Proximal Policy Optimization's Heavy-tailed Gradients [150.08522793940708]
We study the heavy-tailed nature of the gradients of the Proximal Policy Optimization surrogate reward function.
In this paper, we study the effects of the standard PPO clippings, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients.
We propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks.
arXiv Detail & Related papers (2021-02-20T05:51:28Z) - Proximal Policy Optimization with Relative Pearson Divergence [8.071506311915396]
PPO clips density ratio of latest and baseline policies with a threshold, while its minimization target is unclear.
This paper proposes a new variant of PPO by considering a regularization problem of relative Pearson (RPE) divergence, so-called PPO-RPE.
Through four benchmark tasks, PPO-RPE performed as well as or better than the conventional methods in terms of the task performance by the learned policy.
arXiv Detail & Related papers (2020-10-07T09:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.