Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via
Metagradient
- URL: http://arxiv.org/abs/2007.01932v2
- Date: Fri, 31 Jul 2020 04:34:20 GMT
- Title: Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via
Metagradient
- Authors: Yufei Wang, Tianwei Ni
- Abstract summary: Our method is built upon the Soft Actor-Critic (SAC) algorithm, which uses an "entropy temperature" that balances the original task reward and the policy entropy.
We show that Meta-SAC achieves promising performances on several of the Mujoco benchmarking tasks, and outperforms SAC-v2 over 10% in one of the most challenging tasks.
- Score: 5.100592488212484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploration-exploitation dilemma has long been a crucial issue in
reinforcement learning. In this paper, we propose a new approach to
automatically balance between these two. Our method is built upon the Soft
Actor-Critic (SAC) algorithm, which uses an "entropy temperature" that balances
the original task reward and the policy entropy, and hence controls the
trade-off between exploitation and exploration. It is empirically shown that
SAC is very sensitive to this hyperparameter, and the follow-up work (SAC-v2),
which uses constrained optimization for automatic adjustment, has some
limitations. The core of our method, namely Meta-SAC, is to use metagradient
along with a novel meta objective to automatically tune the entropy temperature
in SAC. We show that Meta-SAC achieves promising performances on several of the
Mujoco benchmarking tasks, and outperforms SAC-v2 over 10% in one of the most
challenging tasks, humanoid-v2.
Related papers
- RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End
Robust Estimation [74.47709320443998]
We propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation.
RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set.
Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis.
arXiv Detail & Related papers (2023-08-10T03:14:19Z) - Dealing with Sparse Rewards in Continuous Control Robotics via
Heavy-Tailed Policies [64.2210390071609]
We present a novel Heavy-Tailed Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems.
We show consistent performance improvement across all tasks in terms of high average cumulative reward.
arXiv Detail & Related papers (2022-06-12T04:09:39Z) - Evolving Pareto-Optimal Actor-Critic Algorithms for Generalizability and
Stability [67.8426046908398]
Generalizability and stability are two key objectives for operating reinforcement learning (RL) agents in the real world.
This paper presents MetaPG, an evolutionary method for automated design of actor-critic loss functions.
arXiv Detail & Related papers (2022-04-08T20:46:16Z) - Soft Actor-Critic with Cross-Entropy Policy Optimization [0.45687771576879593]
We propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO)
SAC-CEPO uses Cross-Entropy Method (CEM) to optimize the policy network of SAC.
We show that SAC-CEPO achieves competitive performance against the original SAC.
arXiv Detail & Related papers (2021-12-21T11:38:12Z) - Target Entropy Annealing for Discrete Soft Actor-Critic [64.71285903492183]
Soft Actor-Critic (SAC) is considered the state-of-the-art algorithm for continuous action settings.
It is counter-intuitive that empirical evidence shows SAC does not perform well in discrete domains.
We propose Target Entropy Scheduled SAC (TES-SAC), an annealing method for the target entropy parameter applied on SAC.
We compare our method on Atari 2600 games with different constant target entropy SAC, and analyze on how our scheduling affects SAC.
arXiv Detail & Related papers (2021-12-06T08:21:27Z) - Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with
On-Policy Experience [9.06635747612495]
Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm.
SAC trains a policy by maximizing the trade-off between expected return and entropy.
It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks.
arXiv Detail & Related papers (2021-09-24T06:46:28Z) - Context-Based Soft Actor Critic for Environments with Non-stationary
Dynamics [8.318823695156974]
We propose the Latent Context-based Soft Actor Critic (LC-SAC) method to address aforementioned issues.
By minimizing the contrastive prediction loss function, the learned context variables capture the information of the environment dynamics and the recent behavior of the agent.
Experimental results show that the performance of LC-SAC is significantly better than the SAC algorithm on the MetaWorld ML1 tasks.
arXiv Detail & Related papers (2021-05-07T15:00:59Z) - Meta-Learning with Neural Tangent Kernels [58.06951624702086]
We propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK)
Within this paradigm, we introduce two meta-learning algorithms, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework.
We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory.
arXiv Detail & Related papers (2021-02-07T20:53:23Z) - Band-limited Soft Actor Critic Model [15.11069042369131]
Soft Actor Critic (SAC) algorithms show remarkable performance in complex simulated environments.
We take this idea one step further by artificially bandlimiting the target critic spatial resolution.
We derive the closed form solution in the linear case and show that bandlimiting reduces the interdependency between the low frequency components of the state-action value approximation.
arXiv Detail & Related papers (2020-06-19T22:52:43Z) - Meta Reinforcement Learning with Autonomous Inference of Subtask
Dependencies [57.27944046925876]
We propose and address a novel few-shot RL problem, where a task is characterized by a subtask graph.
Instead of directly learning a meta-policy, we develop a Meta-learner with Subtask Graph Inference.
Our experiment results on two grid-world domains and StarCraft II environments show that the proposed method is able to accurately infer the latent task parameter.
arXiv Detail & Related papers (2020-01-01T17:34:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.