Related papers: Guaranteed Trust Region Optimization via Two-Phase KL Penalization

Guaranteed Trust Region Optimization via Two-Phase KL Penalization

URL: http://arxiv.org/abs/2312.05405v1
Date: Fri, 8 Dec 2023 23:29:57 GMT
Title: Guaranteed Trust Region Optimization via Two-Phase KL Penalization
Authors: K.R. Zentner, Ujjwal Puri, Zhehui Huang, Gaurav S. Sukhatme
Abstract summary: We show that applying KL penalization alone is nearly sufficient to enforce trust regions. We then show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update. The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces.
Score: 11.008537121214104
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: On-policy reinforcement learning (RL) has become a popular framework for solving sequential decision problems due to its computational efficiency and theoretical simplicity. Some on-policy methods guarantee every policy update is constrained to a trust region relative to the prior policy to ensure training stability. These methods often require computationally intensive non-linear optimization or require a particular form of action distribution. In this work, we show that applying KL penalization alone is nearly sufficient to enforce such trust regions. Then, we show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update while adding fewer than 5% additional gradient steps in practice. The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces, is easy to implement, and produces results competitive with other trust region methods.

Related papers

Embedding Safety into RL: A New Take on Trust Region Methods [1.5733417396701983]
Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to unsafe behaviors. We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints.
arXiv Detail & Related papers (2024-11-05T09:55:50Z)
Simple Policy Optimization [15.66748378216631]
Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust region. Proximal Policy Optimization (PPO) addresses this by simplifying TRPO's approach using ratio clipping, improving efficiency but sacrificing some theoretical robustness. This raises a natural question: Can we combine the strengths of both methods? In this paper, we introduce Simple Policy Optimization (SPO), a novel unconstrained first-order algorithm.
arXiv Detail & Related papers (2024-01-29T10:17:54Z)
Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z)
Feasible Policy Iteration for Safe Reinforcement Learning [29.662547846929847]
Safety is the priority concern when applying reinforcement learning (RL) algorithms to real-world control problems. We propose feasible policy iteration (FPI), the first foundational dynamic programming algorithm for safe RL. Experimental results demonstrate that FPI achieves strictly zero constraint violation on low-dimensional tasks.
arXiv Detail & Related papers (2023-04-18T09:18:37Z)
Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee. We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z)
Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL) We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z)
Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning. We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence. We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z)
Differentiable Trust Region Layers for Deep Reinforcement Learning [19.33011160278043]
We propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections. We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions.
arXiv Detail & Related papers (2021-01-22T16:52:06Z)
Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z)
Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL) We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.