Regularized Policies are Reward Robust
- URL: http://arxiv.org/abs/2101.07012v1
- Date: Mon, 18 Jan 2021 11:38:47 GMT
- Title: Regularized Policies are Reward Robust
- Authors: Hisham Husain and Kamil Ciosek and Ryota Tomioka
- Abstract summary: We study the effects of regularization of policies in Reinforcement Learning (RL)
We find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward.
Our results thus give insights into the effects of regularization of policies and deepen our understanding of exploration through robust rewards at large.
- Score: 33.05828095421357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Entropic regularization of policies in Reinforcement Learning (RL) is a
commonly used heuristic to ensure that the learned policy explores the
state-space sufficiently before overfitting to a local optimal policy. The
primary motivation for using entropy is for exploration and disambiguating
optimal policies; however, the theoretical effects are not entirely understood.
In this work, we study the more general regularized RL objective and using
Fenchel duality; we derive the dual problem which takes the form of an
adversarial reward problem. In particular, we find that the optimal policy
found by a regularized objective is precisely an optimal policy of a
reinforcement learning problem under a worst-case adversarial reward. Our
result allows us to reinterpret the popular entropic regularization scheme as a
form of robustification. Furthermore, due to the generality of our results, we
apply to other existing regularization schemes. Our results thus give insights
into the effects of regularization of policies and deepen our understanding of
exploration through robust rewards at large.
Related papers
- Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning [17.245293915129942]
In deep reinforcement learning applications, maximizing discounted reward is often employed instead of maximizing total reward.
We analyzed the suboptimality of the policy obtained through maximizing discounted reward in relation to the policy that maximizes total reward.
We developed methods to align the optimal policies of the two objectives in certain situations, which can improve the performance of reinforcement learning algorithms.
arXiv Detail & Related papers (2024-07-18T08:33:10Z) - CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
We consider and study a distribution of optimal policies.
In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems.
We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
arXiv Detail & Related papers (2022-05-19T09:48:56Z) - Do You Need the Entropy Reward (in Practice)? [29.811723497181486]
It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies.
This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC)
Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation.
arXiv Detail & Related papers (2022-01-28T21:43:21Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Variational Policy Gradient Method for Reinforcement Learning with
General Utilities [38.54243339632217]
In recent years, reinforcement learning systems with general goals beyond a cumulative sum of rewards have gained traction.
In this paper, we consider policy in Decision Problems, where the objective converges a general concave utility function.
We derive a new Variational Policy Gradient Theorem for RL with general utilities.
arXiv Detail & Related papers (2020-07-04T17:51:53Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.