Differentiable Trust Region Layers for Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2101.09207v2
- Date: Tue, 9 Mar 2021 08:44:43 GMT
- Title: Differentiable Trust Region Layers for Deep Reinforcement Learning
- Authors: Fabian Otto, Philipp Becker, Ngo Anh Vien, Hanna Carolin Ziesche, and
Gerhard Neumann
- Abstract summary: We propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections.
We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions.
- Score: 19.33011160278043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Trust region methods are a popular tool in reinforcement learning as they
yield robust policy updates in continuous and discrete action spaces. However,
enforcing such trust regions in deep reinforcement learning is difficult.
Hence, many approaches, such as Trust Region Policy Optimization (TRPO) and
Proximal Policy Optimization (PPO), are based on approximations. Due to those
approximations, they violate the constraints or fail to find the optimal
solution within the trust region. Moreover, they are difficult to implement,
often lack sufficient exploration, and have been shown to depend on seemingly
unrelated implementation choices. In this work, we propose differentiable
neural network layers to enforce trust regions for deep Gaussian policies via
closed-form projections. Unlike existing methods, those layers formalize trust
regions for each state individually and can complement existing reinforcement
learning algorithms. We derive trust region projections based on the
Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius
norm for Gaussian distributions. We empirically demonstrate that those
projection layers achieve similar or better results than existing methods while
being almost agnostic to specific implementation choices. The code is available
at https://git.io/Jthb0.
Related papers
- Diffusion Policies creating a Trust Region for Offline Reinforcement Learning [66.17291150498276]
We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy.
DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient.
We show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds.
arXiv Detail & Related papers (2024-05-30T05:04:33Z) - Image Copy-Move Forgery Detection via Deep PatchMatch and Pairwise Ranking Learning [39.85737063875394]
This study develops a novel end-to-end CMFD framework that integrates the strengths of conventional and deep learning methods.
Unlike existing deep models, our approach utilizes features extracted from high-resolution scales to seek explicit and reliable point-to-point matching.
By leveraging the strong prior of point-to-point matches, the framework can identify subtle differences and effectively discriminate between source and target regions.
arXiv Detail & Related papers (2024-04-26T10:38:17Z) - Guaranteed Trust Region Optimization via Two-Phase KL Penalization [11.008537121214104]
We show that applying KL penalization alone is nearly sufficient to enforce trust regions.
We then show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update.
The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces.
arXiv Detail & Related papers (2023-12-08T23:29:57Z) - Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Provably Convergent Policy Optimization via Metric-aware Trust Region
Methods [21.950484108431944]
Trust-region methods are pervasively used to stabilize policy optimization in reinforcement learning.
We exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions.
We show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes.
arXiv Detail & Related papers (2023-06-25T05:41:38Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Diversity Through Exclusion (DTE): Niche Identification for
Reinforcement Learning through Value-Decomposition [63.67574523750839]
We propose a generic reinforcement learning (RL) algorithm that performs better than baseline deep Q-learning algorithms in environments with multiple variably-valued niches.
We show that agents trained this way can escape poor-but-attractive local optima to instead converge to harder-to-discover higher value strategies.
arXiv Detail & Related papers (2023-02-02T16:00:19Z) - Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via
Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning.
We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence.
We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z) - Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states.
We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization.
Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z) - Quasi-Newton Trust Region Policy Optimization [5.9999375710781]
We propose a trust region method for policy optimization that employs Quasi-Newton approximation for the Hessian.
Gradient descent is the de facto algorithm for reinforcement learning tasks with continuous controls.
arXiv Detail & Related papers (2019-12-26T18:29:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.