Related papers: Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient

Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient

URL: http://arxiv.org/abs/2007.01932v2
Date: Fri, 31 Jul 2020 04:34:20 GMT
Title: Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient
Authors: Yufei Wang, Tianwei Ni
Abstract summary: Our method is built upon the Soft Actor-Critic (SAC) algorithm, which uses an "entropy temperature" that balances the original task reward and the policy entropy. We show that Meta-SAC achieves promising performances on several of the Mujoco benchmarking tasks, and outperforms SAC-v2 over 10% in one of the most challenging tasks.
Score: 5.100592488212484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Exploration-exploitation dilemma has long been a crucial issue in reinforcement learning. In this paper, we propose a new approach to automatically balance between these two. Our method is built upon the Soft Actor-Critic (SAC) algorithm, which uses an "entropy temperature" that balances the original task reward and the policy entropy, and hence controls the trade-off between exploitation and exploration. It is empirically shown that SAC is very sensitive to this hyperparameter, and the follow-up work (SAC-v2), which uses constrained optimization for automatic adjustment, has some limitations. The core of our method, namely Meta-SAC, is to use metagradient along with a novel meta objective to automatically tune the entropy temperature in SAC. We show that Meta-SAC achieves promising performances on several of the Mujoco benchmarking tasks, and outperforms SAC-v2 over 10% in one of the most challenging tasks, humanoid-v2.

Related papers

Distributional Soft Actor-Critic with Three Refinements [47.46661939652862]
Reinforcement learning (RL) has shown remarkable success in solving complex decision-making and control tasks. Many model-free RL algorithms experience performance degradation due to inaccurate value estimation. This paper introduces three key refinements to DSACv1 to overcome these limitations and further improve Q-value estimation accuracy.
arXiv Detail & Related papers (2023-10-09T16:52:48Z)
RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation [74.47709320443998]
We propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation. RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set. Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis.
arXiv Detail & Related papers (2023-08-10T03:14:19Z)
CCE: Sample Efficient Sparse Reward Policy Learning for Robotic Navigation via Confidence-Controlled Exploration [72.24964965882783]
Confidence-Controlled Exploration (CCE) is designed to enhance the training sample efficiency of reinforcement learning algorithms for sparse reward settings such as robot navigation. CCE is based on a novel relationship we provide between gradient estimation and policy entropy. We demonstrate through simulated and real-world experiments that CCE outperforms conventional methods that employ constant trajectory lengths and entropy regularization.
arXiv Detail & Related papers (2023-06-09T18:45:15Z)
Revisiting Discrete Soft Actor-Critic [42.88653969438699]
We study the adaption of Soft Actor-Critic (SAC), which is considered as a state-of-the-art reinforcement learning (RL) algorithm. We propose Stable Discrete SAC (SDSAC), an algorithm that leverages entropy-penalty and double average Q-learning with Q-clip to address these issues.
arXiv Detail & Related papers (2022-09-21T03:01:36Z)
Dealing with Sparse Rewards in Continuous Control Robotics via Heavy-Tailed Policies [64.2210390071609]
We present a novel Heavy-Tailed Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. We show consistent performance improvement across all tasks in terms of high average cumulative reward.
arXiv Detail & Related papers (2022-06-12T04:09:39Z)
Evolving Pareto-Optimal Actor-Critic Algorithms for Generalizability and Stability [67.8426046908398]
Generalizability and stability are two key objectives for operating reinforcement learning (RL) agents in the real world. This paper presents MetaPG, an evolutionary method for automated design of actor-critic loss functions.
arXiv Detail & Related papers (2022-04-08T20:46:16Z)
Soft Actor-Critic with Cross-Entropy Policy Optimization [0.45687771576879593]
We propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO) SAC-CEPO uses Cross-Entropy Method (CEM) to optimize the policy network of SAC. We show that SAC-CEPO achieves competitive performance against the original SAC.
arXiv Detail & Related papers (2021-12-21T11:38:12Z)
Target Entropy Annealing for Discrete Soft Actor-Critic [64.71285903492183]
Soft Actor-Critic (SAC) is considered the state-of-the-art algorithm for continuous action settings. It is counter-intuitive that empirical evidence shows SAC does not perform well in discrete domains. We propose Target Entropy Scheduled SAC (TES-SAC), an annealing method for the target entropy parameter applied on SAC. We compare our method on Atari 2600 games with different constant target entropy SAC, and analyze on how our scheduling affects SAC.
arXiv Detail & Related papers (2021-12-06T08:21:27Z)
Context-Based Soft Actor Critic for Environments with Non-stationary Dynamics [8.318823695156974]
We propose the Latent Context-based Soft Actor Critic (LC-SAC) method to address aforementioned issues. By minimizing the contrastive prediction loss function, the learned context variables capture the information of the environment dynamics and the recent behavior of the agent. Experimental results show that the performance of LC-SAC is significantly better than the SAC algorithm on the MetaWorld ML1 tasks.
arXiv Detail & Related papers (2021-05-07T15:00:59Z)
Band-limited Soft Actor Critic Model [15.11069042369131]
Soft Actor Critic (SAC) algorithms show remarkable performance in complex simulated environments. We take this idea one step further by artificially bandlimiting the target critic spatial resolution. We derive the closed form solution in the linear case and show that bandlimiting reduces the interdependency between the low frequency components of the state-action value approximation.
arXiv Detail & Related papers (2020-06-19T22:52:43Z)
Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies [57.27944046925876]
We propose and address a novel few-shot RL problem, where a task is characterized by a subtask graph. Instead of directly learning a meta-policy, we develop a Meta-learner with Subtask Graph Inference. Our experiment results on two grid-world domains and StarCraft II environments show that the proposed method is able to accurately infer the latent task parameter.
arXiv Detail & Related papers (2020-01-01T17:34:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.