Acceleration in Policy Optimization
- URL: http://arxiv.org/abs/2306.10587v2
- Date: Tue, 5 Sep 2023 23:11:42 GMT
- Title: Acceleration in Policy Optimization
- Authors: Veronica Chelu, Tom Zahavy, Arthur Guez, Doina Precup, Sebastian
Flennerhag
- Abstract summary: We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
- Score: 50.323182853069184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We work towards a unifying paradigm for accelerating policy optimization
methods in reinforcement learning (RL) by integrating foresight in the policy
improvement step via optimistic and adaptive updates. Leveraging the connection
between policy iteration and policy gradient methods, we view policy
optimization algorithms as iteratively solving a sequence of surrogate
objectives, local lower bounds on the original objective. We define optimism as
predictive modelling of the future behavior of a policy, and adaptivity as
taking immediate and anticipatory corrective actions to mitigate accumulating
errors from overshooting predictions or delayed responses to change. We use
this shared lens to jointly express other well-known algorithms, including
model-based policy improvement based on forward search, and optimistic
meta-learning algorithms. We analyze properties of this formulation, and show
connections to other accelerated optimization algorithms. Then, we design an
optimistic policy gradient algorithm, adaptive via meta-gradient learning, and
empirically highlight several design choices pertaining to acceleration, in an
illustrative task.
Related papers
- Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Gradient Informed Proximal Policy Optimization [35.22712034665224]
We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm.
By adaptively modifying the alpha value, we can effectively manage the influence of analytical policy gradients during learning.
Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments.
arXiv Detail & Related papers (2023-12-14T07:50:21Z) - Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - A Parametric Class of Approximate Gradient Updates for Policy
Optimization [47.69337420768319]
We develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function.
We obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality.
arXiv Detail & Related papers (2022-06-17T01:28:38Z) - Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains.
There exist no guarantees on REPS's performance when using gradient-based solvers.
We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Optimistic Distributionally Robust Policy Optimization [2.345728642535161]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are prone to converge to a sub-optimal solution as they limit policy representation to a particular parametric distribution class.
We develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRO) algorithm to solve the trust region constrained optimization problem without parameterizing the policies.
Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability.
arXiv Detail & Related papers (2020-06-14T06:36:18Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.