Trust-Region-Free Policy Optimization for Stochastic Policies
- URL: http://arxiv.org/abs/2302.07985v1
- Date: Wed, 15 Feb 2023 23:10:06 GMT
- Title: Trust-Region-Free Policy Optimization for Stochastic Policies
- Authors: Mingfei Sun, Benjamin Ellis, Anuj Mahajan, Sam Devlin, Katja Hofmann,
Shimon Whiteson
- Abstract summary: We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
- Score: 60.52463923712565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Trust Region Policy Optimization (TRPO) is an iterative method that
simultaneously maximizes a surrogate objective and enforces a trust region
constraint over consecutive policies in each iteration. The combination of the
surrogate objective maximization and the trust region enforcement has been
shown to be crucial to guarantee a monotonic policy improvement. However,
solving a trust-region-constrained optimization problem can be computationally
intensive as it requires many steps of conjugate gradient and a large number of
on-policy samples. In this paper, we show that the trust region constraint over
policies can be safely substituted by a trust-region-free constraint without
compromising the underlying monotonic improvement guarantee. The key idea is to
generalize the surrogate objective used in TRPO in a way that a monotonic
improvement guarantee still emerges as a result of constraining the maximum
advantage-weighted ratio between policies. This new constraint outlines a
conservative mechanism for iterative policy optimization and sheds light on
practical ways to optimize the generalized surrogate objective. We show that
the new constraint can be effectively enforced by being conservative when
optimizing the generalized objective function in practice. We call the
resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) as it is
free of any explicit trust region constraints. Empirical results show that
TREFree outperforms TRPO and Proximal Policy Optimization (PPO) in terms of
policy performance and sample efficiency.
Related papers
- Embedding Safety into RL: A New Take on Trust Region Methods [1.5733417396701983]
Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to unsafe behaviors.
We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints.
arXiv Detail & Related papers (2024-11-05T09:55:50Z) - Guaranteed Trust Region Optimization via Two-Phase KL Penalization [11.008537121214104]
We show that applying KL penalization alone is nearly sufficient to enforce trust regions.
We then show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update.
The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces.
arXiv Detail & Related papers (2023-12-08T23:29:57Z) - Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Provably Convergent Policy Optimization via Metric-aware Trust Region
Methods [21.950484108431944]
Trust-region methods are pervasively used to stabilize policy optimization in reinforcement learning.
We exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions.
We show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes.
arXiv Detail & Related papers (2023-06-25T05:41:38Z) - Feasible Policy Iteration [28.29623882912745]
We propose an indirect safe RL framework called feasible policy iteration.
It guarantees that the feasible region monotonically expands and converges to the maximum one.
Experiments show that our algorithm learns strictly safe and near-optimal policies with accurate feasible regions.
arXiv Detail & Related papers (2023-04-18T09:18:37Z) - Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL)
We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.