A general class of surrogate functions for stable and efficient
reinforcement learning
- URL: http://arxiv.org/abs/2108.05828v5
- Date: Tue, 31 Oct 2023 00:56:56 GMT
- Title: A general class of surrogate functions for stable and efficient
reinforcement learning
- Authors: Sharan Vaswani, Olivier Bachem, Simone Totaro, Robert Mueller, Shivam
Garg, Matthieu Geist, Marlos C. Machado, Pablo Samuel Castro, Nicolas Le Roux
- Abstract summary: We propose a general framework based on functional mirror ascent that gives rise to an entire family of surrogate functions.
We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions.
The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate.
- Score: 45.31904153659212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Common policy gradient methods rely on the maximization of a sequence of
surrogate functions. In recent years, many such surrogate functions have been
proposed, most without strong theoretical guarantees, leading to algorithms
such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we
instead propose a general framework (FMA-PG) based on functional mirror ascent
that gives rise to an entire family of surrogate functions. We construct
surrogate functions that enable policy improvement guarantees, a property not
shared by most existing surrogate functions. Crucially, these guarantees hold
regardless of the choice of policy parameterization. Moreover, a particular
instantiation of FMA-PG recovers important implementation heuristics (e.g.,
using forward vs reverse KL divergence) resulting in a variant of TRPO with
additional desirable properties. Via experiments on simple bandit problems, we
evaluate the algorithms instantiated by FMA-PG. The proposed framework also
suggests an improved variant of PPO, whose robustness and efficiency we
empirically demonstrate on the MuJoCo suite.
Related papers
- Optimization Solution Functions as Deterministic Policies for Offline Reinforcement Learning [7.07623669995408]
We propose an implicit actor-critic (iAC) framework that employs optimization solution functions as a deterministic policy (actor) and a monotone function over the optimal value of optimization as a critic.
We show that the learned policies are robust to the suboptimality of the learned actor parameters via the exponentially decaying sensitivity (EDS) property.
We validate the proposed framework on two real-world applications and show a significant improvement over state-of-the-art (SOTA) offline RL methods.
arXiv Detail & Related papers (2024-08-27T19:04:32Z) - Optimistic Multi-Agent Policy Gradient [23.781837938235036]
Relative overgeneralization (RO) occurs when agents converge towards a suboptimal joint policy.
No methods have been proposed for addressing RO in multi-agent policy gradient (MAPG) methods.
We propose a general, yet simple, framework to enable optimistic updates in MAPG methods that alleviate the RO problem.
arXiv Detail & Related papers (2023-11-03T14:47:54Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Model-Based Decentralized Policy Optimization [27.745312627153012]
Decentralized policy optimization has been commonly used in cooperative multi-agent tasks.
We propose model-based decentralized policy optimization (MDPO)
We theoretically analyze that the policy optimization of MDPO is more stable than model-free decentralized policy optimization.
arXiv Detail & Related papers (2023-02-16T08:15:18Z) - Mono-surrogate vs Multi-surrogate in Multi-objective Bayesian
Optimisation [0.0]
We build a surrogate model for each objective function and show that the scalarising function distribution is not Gaussian.
Results and comparison with existing approaches on standard benchmark and real-world optimisation problems show the potential of the multi-surrogate approach.
arXiv Detail & Related papers (2022-05-02T09:25:04Z) - CUP: A Conservative Update Policy Algorithm for Safe Reinforcement
Learning [14.999515900425305]
We propose a Conservative Update Policy with a theoretical safety guarantee.
We provide rigorous theoretical analysis to extend the surrogate functions to generalized advantage (GAE)
Experiments show the effectiveness of the CUP to design safe constraints.
arXiv Detail & Related papers (2022-02-15T16:49:28Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Permutation Invariant Policy Optimization for Mean-Field Multi-Agent
Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence.
In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z) - Multi-agent Policy Optimization with Approximatively Synchronous
Advantage Estimation [55.96893934962757]
In multi-agent system, polices of different agents need to be evaluated jointly.
In current methods, value functions or advantage functions use counter-factual joint actions which are evaluated asynchronously.
In this work, we propose the approximatively synchronous advantage estimation.
arXiv Detail & Related papers (2020-12-07T07:29:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.