Guaranteed Trust Region Optimization via Two-Phase KL Penalization
- URL: http://arxiv.org/abs/2312.05405v1
- Date: Fri, 8 Dec 2023 23:29:57 GMT
- Title: Guaranteed Trust Region Optimization via Two-Phase KL Penalization
- Authors: K.R. Zentner, Ujjwal Puri, Zhehui Huang, Gaurav S. Sukhatme
- Abstract summary: We show that applying KL penalization alone is nearly sufficient to enforce trust regions.
We then show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update.
The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces.
- Score: 11.008537121214104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On-policy reinforcement learning (RL) has become a popular framework for
solving sequential decision problems due to its computational efficiency and
theoretical simplicity. Some on-policy methods guarantee every policy update is
constrained to a trust region relative to the prior policy to ensure training
stability. These methods often require computationally intensive non-linear
optimization or require a particular form of action distribution. In this work,
we show that applying KL penalization alone is nearly sufficient to enforce
such trust regions. Then, we show that introducing a "fixup" phase is
sufficient to guarantee a trust region is enforced on every policy update while
adding fewer than 5% additional gradient steps in practice. The resulting
algorithm, which we call FixPO, is able to train a variety of policy
architectures and action spaces, is easy to implement, and produces results
competitive with other trust region methods.
Related papers
- Embedding Safety into RL: A New Take on Trust Region Methods [1.5733417396701983]
Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to unsafe behaviors.
We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints.
arXiv Detail & Related papers (2024-11-05T09:55:50Z) - Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL)
We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z) - Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via
Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning.
We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence.
We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z) - Differentiable Trust Region Layers for Deep Reinforcement Learning [19.33011160278043]
We propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections.
We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions.
arXiv Detail & Related papers (2021-01-22T16:52:06Z) - Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states.
We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization.
Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.