Do You Need the Entropy Reward (in Practice)?
- URL: http://arxiv.org/abs/2201.12434v1
- Date: Fri, 28 Jan 2022 21:43:21 GMT
- Title: Do You Need the Entropy Reward (in Practice)?
- Authors: Haonan Yu, Haichao Zhang, Wei Xu
- Abstract summary: It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies.
This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC)
Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation.
- Score: 29.811723497181486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Maximum entropy (MaxEnt) RL maximizes a combination of the original task
reward and an entropy reward. It is believed that the regularization imposed by
entropy, on both policy improvement and policy evaluation, together contributes
to good exploration, training convergence, and robustness of learned policies.
This paper takes a closer look at entropy as an intrinsic reward, by conducting
various ablation studies on soft actor-critic (SAC), a popular representative
of MaxEnt RL. Our findings reveal that in general, entropy rewards should be
applied with caution to policy evaluation. On one hand, the entropy reward,
like any other intrinsic reward, could obscure the main task reward if it is
not properly managed. We identify some failure cases of the entropy reward
especially in episodic Markov decision processes (MDPs), where it could cause
the policy to be overly optimistic or pessimistic. On the other hand, our
large-scale empirical study shows that using entropy regularization alone in
policy improvement, leads to comparable or even better performance and
robustness than using it in both policy improvement and policy evaluation.
Based on these observations, we recommend either normalizing the entropy reward
to a zero mean (SACZero), or simply removing it from policy evaluation
(SACLite) for better practical results.
Related papers
- Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation [0.276240219662896]
A notable form of entropy regularisation is augmenting the objective with an entropy term, thereby simultaneously optimising the expected return and the entropy.
This framework, known as maximum entropy reinforcement learning (MaxEnt RL), has shown theoretical and empirical successes.
This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings.
arXiv Detail & Related papers (2024-07-25T15:48:24Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Predictable Reinforcement Learning Dynamics through Entropy Rate
Minimization [17.845518684835913]
In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors.
We propose a novel method to induce predictable behavior in RL agents, referred to as Predictability-Aware RL (PA-RL)
We show how the entropy rate can be formulated as an average reward objective, and since its entropy reward function is policy-dependent, we introduce an action-dependent surrogate entropy.
arXiv Detail & Related papers (2023-11-30T16:53:32Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Maximum Entropy Reinforcement Learning with Mixture Policies [54.291331971813364]
We construct a tractable approximation of the mixture entropy using MaxEnt algorithms.
We show that it is closely related to the sum of marginal entropies.
We derive an algorithmic variant of Soft Actor-Critic (SAC) to the mixture policy case and evaluate it on a series of continuous control tasks.
arXiv Detail & Related papers (2021-03-18T11:23:39Z) - Regularized Policies are Reward Robust [33.05828095421357]
We study the effects of regularization of policies in Reinforcement Learning (RL)
We find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward.
Our results thus give insights into the effects of regularization of policies and deepen our understanding of exploration through robust rewards at large.
arXiv Detail & Related papers (2021-01-18T11:38:47Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State
Entropy Estimate [40.97686031763918]
In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy?
We argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target.
We present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy.
arXiv Detail & Related papers (2020-07-09T08:44:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.