Related papers: Differentiable Trust Region Layers for Deep Reinforcement Learning

Differentiable Trust Region Layers for Deep Reinforcement Learning

URL: http://arxiv.org/abs/2101.09207v2
Date: Tue, 9 Mar 2021 08:44:43 GMT
Title: Differentiable Trust Region Layers for Deep Reinforcement Learning
Authors: Fabian Otto, Philipp Becker, Ngo Anh Vien, Hanna Carolin Ziesche, and Gerhard Neumann
Abstract summary: We propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections. We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions.
Score: 19.33011160278043
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep reinforcement learning is difficult. Hence, many approaches, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), are based on approximations. Due to those approximations, they violate the constraints or fail to find the optimal solution within the trust region. Moreover, they are difficult to implement, often lack sufficient exploration, and have been shown to depend on seemingly unrelated implementation choices. In this work, we propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections. Unlike existing methods, those layers formalize trust regions for each state individually and can complement existing reinforcement learning algorithms. We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions. We empirically demonstrate that those projection layers achieve similar or better results than existing methods while being almost agnostic to specific implementation choices. The code is available at https://git.io/Jthb0.

Related papers

Diffusion Policies creating a Trust Region for Offline Reinforcement Learning [66.17291150498276]
We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy. DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient. We show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds.
arXiv Detail & Related papers (2024-05-30T05:04:33Z)
Image Copy-Move Forgery Detection via Deep PatchMatch and Pairwise Ranking Learning [39.85737063875394]
This study develops a novel end-to-end CMFD framework that integrates the strengths of conventional and deep learning methods. Unlike existing deep models, our approach utilizes features extracted from high-resolution scales to seek explicit and reliable point-to-point matching. By leveraging the strong prior of point-to-point matches, the framework can identify subtle differences and effectively discriminate between source and target regions.
arXiv Detail & Related papers (2024-04-26T10:38:17Z)
Guaranteed Trust Region Optimization via Two-Phase KL Penalization [11.008537121214104]
We show that applying KL penalization alone is nearly sufficient to enforce trust regions. We then show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update. The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces.
arXiv Detail & Related papers (2023-12-08T23:29:57Z)
Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z)
Provably Convergent Policy Optimization via Metric-aware Trust Region Methods [21.950484108431944]
Trust-region methods are pervasively used to stabilize policy optimization in reinforcement learning. We exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions. We show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes.
arXiv Detail & Related papers (2023-06-25T05:41:38Z)
Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee. We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z)
Diversity Through Exclusion (DTE): Niche Identification for Reinforcement Learning through Value-Decomposition [63.67574523750839]
We propose a generic reinforcement learning (RL) algorithm that performs better than baseline deep Q-learning algorithms in environments with multiple variably-valued niches. We show that agents trained this way can escape poor-but-attractive local optima to instead converge to harder-to-discover higher value strategies.
arXiv Detail & Related papers (2023-02-02T16:00:19Z)
Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning. We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence. We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z)
Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z)
Quasi-Newton Trust Region Policy Optimization [5.9999375710781]
We propose a trust region method for policy optimization that employs Quasi-Newton approximation for the Hessian. Gradient descent is the de facto algorithm for reinforcement learning tasks with continuous controls.
arXiv Detail & Related papers (2019-12-26T18:29:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.