Target Entropy Annealing for Discrete Soft Actor-Critic
- URL: http://arxiv.org/abs/2112.02852v1
- Date: Mon, 6 Dec 2021 08:21:27 GMT
- Title: Target Entropy Annealing for Discrete Soft Actor-Critic
- Authors: Yaosheng Xu and Dailin Hu and Litian Liang and Stephen McAleer and
Pieter Abbeel and Roy Fox
- Abstract summary: Soft Actor-Critic (SAC) is considered the state-of-the-art algorithm for continuous action settings.
It is counter-intuitive that empirical evidence shows SAC does not perform well in discrete domains.
We propose Target Entropy Scheduled SAC (TES-SAC), an annealing method for the target entropy parameter applied on SAC.
We compare our method on Atari 2600 games with different constant target entropy SAC, and analyze on how our scheduling affects SAC.
- Score: 64.71285903492183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Soft Actor-Critic (SAC) is considered the state-of-the-art algorithm in
continuous action space settings. It uses the maximum entropy framework for
efficiency and stability, and applies a heuristic temperature Lagrange term to
tune the temperature $\alpha$, which determines how "soft" the policy should
be. It is counter-intuitive that empirical evidence shows SAC does not perform
well in discrete domains. In this paper we investigate the possible
explanations for this phenomenon and propose Target Entropy Scheduled SAC
(TES-SAC), an annealing method for the target entropy parameter applied on SAC.
Target entropy is a constant in the temperature Lagrange term and represents
the target policy entropy in discrete SAC. We compare our method on Atari 2600
games with different constant target entropy SAC, and analyze on how our
scheduling affects SAC.
Related papers
- Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Do You Need the Entropy Reward (in Practice)? [29.811723497181486]
It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies.
This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC)
Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation.
arXiv Detail & Related papers (2022-01-28T21:43:21Z) - Soft Actor-Critic with Cross-Entropy Policy Optimization [0.45687771576879593]
We propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO)
SAC-CEPO uses Cross-Entropy Method (CEM) to optimize the policy network of SAC.
We show that SAC-CEPO achieves competitive performance against the original SAC.
arXiv Detail & Related papers (2021-12-21T11:38:12Z) - Continuous-Time Fitted Value Iteration for Robust Policies [93.25997466553929]
Solving the Hamilton-Jacobi-Bellman equation is important in many domains including control, robotics and economics.
We propose continuous fitted value iteration (cFVI) and robust fitted value iteration (rFVI)
These algorithms leverage the non-linear control-affine dynamics and separable state and action reward of many continuous control problems.
arXiv Detail & Related papers (2021-10-05T11:33:37Z) - Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with
On-Policy Experience [9.06635747612495]
Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm.
SAC trains a policy by maximizing the trade-off between expected return and entropy.
It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks.
arXiv Detail & Related papers (2021-09-24T06:46:28Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Maximum Entropy Reinforcement Learning with Mixture Policies [54.291331971813364]
We construct a tractable approximation of the mixture entropy using MaxEnt algorithms.
We show that it is closely related to the sum of marginal entropies.
We derive an algorithmic variant of Soft Actor-Critic (SAC) to the mixture policy case and evaluate it on a series of continuous control tasks.
arXiv Detail & Related papers (2021-03-18T11:23:39Z) - Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via
Metagradient [5.100592488212484]
Our method is built upon the Soft Actor-Critic (SAC) algorithm, which uses an "entropy temperature" that balances the original task reward and the policy entropy.
We show that Meta-SAC achieves promising performances on several of the Mujoco benchmarking tasks, and outperforms SAC-v2 over 10% in one of the most challenging tasks.
arXiv Detail & Related papers (2020-07-03T20:26:50Z) - Band-limited Soft Actor Critic Model [15.11069042369131]
Soft Actor Critic (SAC) algorithms show remarkable performance in complex simulated environments.
We take this idea one step further by artificially bandlimiting the target critic spatial resolution.
We derive the closed form solution in the linear case and show that bandlimiting reduces the interdependency between the low frequency components of the state-action value approximation.
arXiv Detail & Related papers (2020-06-19T22:52:43Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.