CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric
- URL: http://arxiv.org/abs/2110.10522v1
- Date: Wed, 20 Oct 2021 12:20:52 GMT
- Title: CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric
- Authors: Yunxiao Guo, Han Long, Xiaojun Duan, Kaiyuan Feng, Maochu Li, Xiaying
Ma
- Abstract summary: As an algorithm based on deep reinforcement learning, Proximal Policy Optimization (PPO) performs well in many complex tasks.
Clip-PPO is widely used in a variety of practical scenarios and has attracted the attention of many researchers.
As a more theoretical algorithm, KL-PPO was neglected because its performance was not as good as CliP-PPO.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As an algorithm based on deep reinforcement learning, Proximal Policy
Optimization (PPO) performs well in many complex tasks and has become one of
the most popular RL algorithms in recent years. According to the mechanism of
penalty in surrogate objective, PPO can be divided into PPO with KL Divergence
(KL-PPO) and PPO with Clip function(Clip-PPO). Clip-PPO is widely used in a
variety of practical scenarios and has attracted the attention of many
researchers. Therefore, many variations have also been created, making the
algorithm better and better. However, as a more theoretical algorithm, KL-PPO
was neglected because its performance was not as good as CliP-PPO. In this
article, we analyze the asymmetry effect of KL divergence on PPO's objective
function , and give the inequality that can indicate when the asymmetry will
affect the efficiency of KL-PPO. Proposed PPO with Correntropy Induced Metric
algorithm(CIM-PPO) that use the theory of correntropy(a symmetry metric method
that was widely used in M-estimation to evaluate two distributions'
difference)and applied it in PPO. Then, we designed experiments based on
OpenAIgym to test the effectiveness of the new algorithm and compare it with
KL-PPO and CliP-PPO.
Related papers
- Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study [16.99550556866219]
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences.
In academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO)
We show that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.
arXiv Detail & Related papers (2024-04-16T16:51:53Z) - Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for
LLM Alignment [37.52249093928251]
This paper proposes a new framework, reinforcement learning with relative feedback, and a novel trajectory-wise policy gradient algorithm.
We show theoretically that P3O is invariant to equivalent rewards and avoids the complexity of PPO.
Empirical evaluations demonstrate that P3O outperforms PPO in the KL-Reward trade-off and can align with human preferences as well as or better than prior methods.
arXiv Detail & Related papers (2023-09-30T01:23:22Z) - Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences.
We show that it consistently outperforms PPO in language tasks by a large margin.
We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Permutation Invariant Policy Optimization for Mean-Field Multi-Agent
Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence.
In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z) - The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games [67.47961797770249]
Multi-Agent PPO (MAPPO) is a multi-agent PPO variant which adopts a centralized value function.
We show that MAPPO achieves performance comparable to the state-of-the-art in three popular multi-agent testbeds.
arXiv Detail & Related papers (2021-03-02T18:59:56Z) - Proximal Policy Optimization Smoothed Algorithm [0.0]
We present a PPO variant, named Proximal Policy Optimization Smooth Algorithm (PPOS)
Its critical improvement is the use of a functional clipping method instead of a flat clipping method.
We show that it outperforms the latest PPO variants on both performance and stability in challenging continuous control tasks.
arXiv Detail & Related papers (2020-12-04T07:43:50Z) - Proximal Policy Optimization with Relative Pearson Divergence [8.071506311915396]
PPO clips density ratio of latest and baseline policies with a threshold, while its minimization target is unclear.
This paper proposes a new variant of PPO by considering a regularization problem of relative Pearson (RPE) divergence, so-called PPO-RPE.
Through four benchmark tasks, PPO-RPE performed as well as or better than the conventional methods in terms of the task performance by the learned policy.
arXiv Detail & Related papers (2020-10-07T09:11:22Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.